![]() | This proposal is part of "A Dozen Visions for Wikitext". Shortcuts: Versioning - Grunge - Markdown - HTML-only wikis - Extension tag fragments - Syntax uniformity - Colon replacement - Backticks - Syntax for Discussions - #media - #lang - #balance - Long arguments - Variable-length/structured arguments - Annotations - Visual Templates - Page Description Language - Native Script Editing - One Wiki |
This is some text from the English Wikipedia article titled Elevator:
An elevator is a machine that vertically transports people or freight between levels. They are typically powered by electric motors that drive traction cables and counterweight systems such as a hoist, although some pump hydraulic fluid to raise a cylindrical piston like a jack.
If you speak British English, of course, this isn’t an “elevator” at all. This is a Lift:
An elevator (American English, also in Canada) or lift (Commonwealth English except Canada) is a machine that vertically transports people or freight between levels. They are typically powered by electric motors that drive traction cables and counterweight systems such as a hoist, although some pump hydraulic fluid to raise a cylindrical piston like a jack.
American and British English vary on a number of terms and pronunciations, which co-exist somewhat uneasily on English Wikipedia. And there are further variants found in Indian English, Kenyan English, and other places around the world. There is a long policy on Wikipedia about where and how to use regional terms. But, with a few exceptions, the different variants are mutually intelligible.
As it turns out, Serbian sides with the Brits. This is Lift on Serbian Wikipedia:
Lift služi za vertikalni ili strmi prevoz osoba i materijala, pa se i deli na putničke ili teretne liftove. Za prevoz (transport) se koriste otvorene ili zatvorene kabine s mogućnošću zaustavljanja na potrebnom broju stanica. Primenjuju se u stambenim i poslovnim zgradama, rudnicima, industrijskim postrojenjima, brodovima, na gradilištima i slično.
But as it turns out, there are multiple ways to write Serbian. This is the same article, but written in Cyrillic script, Лифт:
Лифт служи за вертикални или стрми превоз особа и материјала, па се и дели на путничке или теретне лифтове. За превоз (транспорт) се користе отворене или затворене кабине с могућношћу заустављања на потребном броју станица. Примењују се у стамбеним и пословним зградама, рудницима, индустријским постројењима, бродовима, на градилиштима и слично.
LanguageConverter allows you to switch between these views of the article, depending on which script you prefer to read. As I understand it, many people in Serbia can read both. There are also spelling and vocabulary differences, orthogonal to the script choice.
This is the same article on Chinese Wikipedia, 電梯, as you would read it if you lived in Taiwan, Hong Kong, or Macau:
電梯(英語:Elevator),其他地區亦稱升降機。在香港、新加坡和馬來西亞俗稱「𨋢」(英語lift的譯音),是一種垂直運送行人或貨物的運輸工具。據統計,2002年全球電梯總數超過600萬部,是現代使用最多的垂直運輸工具。
It is written with what are called “traditional” characters.
But if you’re on the mainland, the title and appearance of the article is different, 电梯:
电梯(英语:Elevator),其他地区亦称升降机。在香港、新加坡和马来西亚俗称“䢂”(英语lift的译音),是一种垂直运送行人或货物的运输工具。据统计,2002年全球电梯总数超过600万部,是现代使用最多的垂直运输工具。
Compare the first character of the titles carefully: 電 versus 电. Mainland China uses “simplified” characters, which were introduced in 1956 and 1964 by the mainland People’s Republic of China government in an attempt to promote literacy.[1]
Unlike Serbian, few people are equally fluent in both traditional and simplified characters. There are thousands of characters which differ between the two orthographies, although note that not all of the characters change. (The second character in the article title 梯 does not change, for instance.)
This is the Hindi article, written in Devanagari script, उत्थापक:
उत्थापक, उच्चालित्र अथवा एलिवेटर (lift या elevator) एक युक्ति है वस्तुओं एवं व्यक्तिओं को उर्ध्व दिशा में चढ़ाने-उतारने के काम आती है। प्रायः किसी बहुमंजिला ऊँचे भवन, जलपोत एवं अन्य संरचनाओं में उत्थापक लगा होता है जो गोलों को या सामान आदि को एक मंजिल से दूसरी मंजिल या एक स्तर से दूसरे स्तर पर लाता और ले जाता है। उत्थापक प्रायः विद्युत मोटर द्वारा चलते हैं।
And this is the article on elevators in Urdu, رافعہ:
انتصابی نقل و حمل کی کل۔ جدید عمارتوں، جہازوں اور کانوں میں استعمال ہونے والی تمام کھلی اور بند ساختوں اور لگاتار چلنے والے ان پٹوں کو بھی رافع یا (انگریزی:Elevator/Lift) کہا جاتا ہے جو بھاری چیزوں کو ایک جگہ سے دوسری جگہ پہنچاتے ہیں۔ طاقت سے چلنے والے رافع جو عام طور پر بھاپ سے کام کرتے تھے انیسویں صدی عیسوی سے استعمال ہو رہے تھے جبکہ اس صدی کے اواخر میں برقی رافعہ عام ہو گیا۔
Urdu is written in Arabic script, right to left. Urdu and Hindi are different dialects of the Hindustani language, written in very different scripts: Arabic script on the Pakistani side of the border, Devanagari script on the Indian side.[2]
Punjabi is a similar case, with two smaller wikis split by the India-Pakistan border. There are four scripts used to write Punjabi: Shahmukhi (closely related to Urdu script) and Gurmukhī are the most commonly ones used for writing Punjabi and are considered the official scripts of the language, but Devanāgarī and Latin scripts are occasionally used as well.
As I mentioned earlier, wikitext has a feature called LanguageConverter that automatically converts between words in different dialects, different scripts for the same language, and, if you squint a bit, it is used to translate between languages as well.[3] It is currently in use on 15 wikis, requested on more, and probably relevant to most of our wikis to some degree.[4]
It lets you write markup like this to define word or script variants, and then those are automatically applied:
-{en-us:elevator; en-gb:lift}-
And there is markup to prevent the conversion in places where it isn't appropriate:
Please -{lift}- those boxes and place them in the lift.
This is a brief summary of more complete talks I've given in the past (MediaWiki Developer Summit 2015, Wikimania 2017, Wikimedia Language Engineering Community Meeting May 2024, Wikimania 2024). But to summarize, there are a number of features missing from LanguageConverter, including glossary support and a means to mark the source variant of a text. The markup used for language converter is cumbersome (although glossaries would help) and may benefit from annotations in order to move some of the markup out of the text.
Maintenance of LanguageConverter is also made difficult by the current architecture, which requires someone with the combined skills of a linguist and a PHP regular expressions expert in order to create or modify the converter for a language or script pair. In order to safely make changes to LanguageConverter engines, the canary system used by Parsoid could be applied to evaluate a proposed change on a large set of articles. With the aid of such a system to evaluate changes to LanguageConverter, we could consider moving some or all of the engines to a more maintainable system, for example the rules-based system used by libicu which uses transformation rules in a form familiar to linguists.[5]
However, the biggest drawback of LanguageConverter can be seen by simply attempting to edit an article on Serbian wikipedia, for example Хемијски елемент:
== Списак 118 познатих хемијских елемената ==
Следећа табела садржи 118 познатих хемијских елемената.
* '''Атомски број''', '''име''', и '''симбол''' служе независно као јединствени идентификатори.
* '''Имена''' су она која су прихваћена од стране -{[[Међународна унија за чисту и примењену хемију|IUPAC]]}-; провизиона имена за недавно произведене елементе који нису формално именовани су дата у заградама.
* '''Група, периода,''' и '''блок''' се односе на позицију елемента у [[периодни систем|периодном систему]]. Бројеви група су у тренутно званично прихваћеној нотацији; за старије алтернативне нотације погледајте [[Група (периодни систем)|Група периодног система елемената]].
* '''Stanje materije''' ''(Čvrsto, tečno,'' ili ''gasovito)'' se odnosi na standardne uslove [[temperatura|temperature]] i [[pritisak|pritiska]] ([[Стандардни услови за температуру и притисак|STP]]).
* '''Pojavljivanje''' pravi razliku između elemenata koji se javljaju u prirodi, kategorisane kao bilo ''Praiskonski'' ili ''Prolazni'' (u smislu raspada), i ''Sintetički'' elementi koji su proizvedeni tehnološkim putem, i nisu prirodno poznati.
* '''Opis''' sumira svojstva elementa koristeći opširne kategorije koje su prisutne u periodnom sistemu: [[aktinoid]], [[alkalni metal]], [[Земноалкални метал|zemnoalkalni metal]], [[halogen]], [[lantanoidi|lantanoid]], [[metal]], [[metaloid]], [[plemeniti gas]], [[nemetal]], i [[Prelazni metali|prelazni metal]].
Notice that there's a mix of Cyrillic script at the top and Latin script at the bottom. Luckily, most users of Serbian Wikipedia can read and write both scripts for their language. But this isn’t the case for most of our other language variants, who can read and write only one of the scripts for their language comfortably. This can be a huge barrier to editors, who can read the article (as output from LanguageConverter in their preferred script) but can’t read or write it once the editor starts up and the script changes. This is also a barrier to the adoption of LanguageConverter for several wikis which could use it, who fear fragmenting their editor base.
Native Script Editing (phab:T17161, phab:T113002, phab:T87652) is a feature proposal which uses Parsoid’s selective serialization support to editing the article entirely in the user's preferred variant, with transparent conversion so that we don’t “dirty diff” unedited content.
To drive this home: LanguageConverter is used on Chinese Wikipedia, with a billion speakers, who are not biliterate and can not generally read both simplified and traditional characters. Lack of Native Script Editing blocks editor growth on Chinese Wikipedia.
Next section: One Wiki
zh-hk
("Chinese as spoken in Hong Kong") converter relates to the yue
(Cantonese) language.