These are notes from a teleconference discussion the parsing team had on "Resilience", which is one of the theme topic areas for the 2018 WMF Technical Conference. Apologies for loose structure.
Broader topic area is "Scale": Identify where in the wiki lifecycle each wiki is in, try to adapt to that:
- Retain editors in editor-decline phase
- Create new content for young wikis
Mako's research re: lifecycles, confirming research on wikia, etc.
- seems genuinely part of social process, not an artifact of chronological time ("in 2005 people stopped editing things online") or software tools ("in 2005 the wiki editor broke")
- Genuinely useful to think hard about what it means for the WMF to explicitly strategize with wikis in different parts of the cycle.
- ...but even enwiki isn't uniformly in the "late wiki" phase, there are regions ("articles geolocated on the african continent") and communities ("women scientists") that are still in the "not enough content" phase.
Further thoughts:
"What is the greatest potential threat...?"
- also wiki lifecycle specific?, ie
- editor intimidation (for mature wikis with few editors)
- content barriers (for young wikis lacking content)
- but is enwiki really a "mature" wiki?
"Resilience vs resistance to change?"
- another way of looking at the wiki lifecycle (which theorizes that wiki communities become resistant to change once their "treasure" of good content builds up)
"Served in a data-efficient, useful, and reliable manner"
- You'll get different technological solutions here based on whether you believe:
- The wiki already has all the valuable content and the challenge is getting it to users (IPFS schemes tend here)
- The hardest part is getting new content *from* users
- safely, or easily/low-friction
- The greatest challenge is keeping readers *up to date* with the rate of change of a quickly growing wiki
- You might see all three regimes in a single wiki, eg
- Medical information in English
- Articles about low-bandwidth regions of the world
- Articles on olympics results, or breaking news stories
Language translation technology is one approach at bridging gaps between mature and less-mature wikis. The low-content wiki can get translations of general-interest articles while contributing back articles on its particular area of the world/culture/etc.
- But then you have to solve two "wiki lifecycle" issues at once!
- Sometimes you also need to solve linguistic issues
- developing language models w/o good access to native speakers
- languages which are censored/limited in their home regions
- content which is censored/limited in a language region
- script/writing system/vocabulary issues
- We separate wikis by language, but this is sometimes a poor proxy for "nationality", "legal regime", or "culture"
- although sometime we exploit this gap on purpose
- eg to allow "high freedom" content to be read in "low freedom" neighboring regions, either directly or via translation.
(From Arlo): Think of resilience like health": an organism can only
sustain so much stress before it starts to decline.
- Identify stressors
- Quantify stressors
- Alleviate stressors
Useful framing question: resilience against what?
- decline? (but all things die, and sometimes that's good)
- "premature decline"? (and "old wiki" community w/o a lot of content?)
- gradual corruption?
- external threat? (but what?)
- natural disaster?
- legal challenge?
- cultural/social shifts?
- ...?
Can be useful just to enumerate these threats, to determine if there are possibly some common strategies to combat many threats, instead of dealing with
them individually.
We need to hold "centralized" and "decentralized" in balance.
Decentralization increases resilience but harms scale, and vice-versa.
Concretely:
1. Global templates
- Our project isn't just content, it's also process and social models.
- Global templates allow us to export & share workflow
- of course, workflows of "big wikis" may not be appropriate for "small wikis" and vice-versa. But we can still share among similar wikis
- Big technical issue is always translation:
- template names
- template parameters
- template documentation (closest to being solved)
- code & comments in code-heavy modules (Scribunto)
- More abstractly, translation between communities.
- Loosely-coupled wikis allow improvisation and innovation w/in constraints
- Global templates increase dependencies, further entangling wikis, and impairing resilience (in the "decentralized/federated" sense)
- We don't have good dependency-management tools
- We don't have good fork management tools
- In theory, we wouldn't need global templates at all if we have really good tools to manage and synchronize forks.
- github doesn't have any way to mark a repository as the "centralized authoritative source" of a codebase, eg.
- Equity issues: are we imposing en/de templates on the rest of the world, or will this be a place where en/de can listen to smaller wikis and learn new tricks?
- Also: different wikis have different cultural norms re who can edit templates. (Some of these rules are technical artifacts about the computational impact of changing certain templates.)
2. Article templates.
- As opposed to "global templates", what is meant here are improvements to the fundamental template mechanism
- The parsing team has proposed/is working on quite a few ideas here, ranging from "we've got wip patches already" to more speculative future-of-the-platform stuff under our informal "wikitext 2.0" banner:
- Balanced templates, heredoc arguments, improved templatedata, semantics of transclusion, scribunto/js, visualeditor-for-templates, etc
- Many of the same issues apply re: global templates: translation challenges, dependency/fork management, continuing to allow local innovation, etc.
- In addition, the complexity of common templates has often been mentioned, or the complexity of editing/authoring them.
- (See T114454 for one attempt at solving this issue)
- Many of our tools would like to have more semantic information about templates. TemplateData is a small step in this direction.
- Our infrastructure would like to have tighter semantics for templates. (Edge or client-side composition, granular caching, etc)
- Generally: much of our UX is "content", created by the template mechanism:
- Inter language links (until recently)
- Infoboxes / sidebars
- Navboxes / footer information
- Image styling, on many wikis
- Workflow annotations, like
{{citation needed}}
and many categories
- It is a testament to the brilliance of the template system that it can be extended so far, to achieve all these different tasks...
- ...but we should probably be investing in proper tools, in order to move from the template-enabled "innovation" phase to "production".
3. Translation, including machine translation, as a means of increasing scale
- I've already written a lot about this:
I've already written about offline editing queues as a mechanism to enhance access from challenging areas:
- Edit Conflicts, Offline Contributions, and Tor: Oh my!
This would be a useful area in which to deploy prototypes and pursue active research, for example on privacy-preserving reputation
systems. This could be done over the subset of Tor-using Wikimedians, so that "failed experiments" don't adversely impact the social processes of our larger community. Permission to fail!
Note that there are three fundamental conflicts in play:
- Immutable signed/attested content -vs- "encyclopedia anyone can edit"
- and don't forget "right to be forgotten", libel laws, biography of living persons, DMCA takedown, vandalism, etc
- distributing content also means distributing the liability for content deemed outré in your particular legal regime
- Strong reputation system -vs- protecting identity of editors
- say, in repressive regimes, or against hate/bias/harassment
- "Every wiki is its own community" -vs- centralization and "scalability"
- eg global templates, sharing workflows, etc
- centralization also means agreeing on a single legal regime (but there may not be one single best regime)
Various "cryptocurrency"-themes proposals should be treated as high-risk proposals, on par with the way that (say) the idea that Wikipedia should use a github-like fork-and-merge model has been treated.
- Again, the Wikimedia project is not just the content, but a particular social model, which has been embedded in both the mediawiki codebase as well as countless templates and policy pages on wiki
- Any shift in how edits are distributed or compensated, or how reputation is calculated, shifts the incentives in this social system.
Regarding content distribution:
- Anything wiki-specific will be blocked
- Only *practical* solution is to piggyback on something which countries "can't afford" to block
- That requires the killer app to come *before* the disputed content! (hard)
- The only current technology meeting this description is HTTPS — and that's why efforts like the EFF's HTTPS-Everywhere are valuable: you are relying on steadily increasing the cost of blocking HTTPS
- Tor is a close second here.
- Many ideas proposed in this space are counterproductive: by drawing attention to either WP or the underlying technology, they are most likely to get *both* blocked, rather than actually enhance access.
- Everyone who holds a copy of WP will likely be legally liable for everything in WP.
- If you make the copies opaque (encrypted, shared shards, etc) the authorities will probably just assume that your part has the stuff they don't like, and you can't prove otherwise.
- If you make it easy to hold only the parts of WP which are "safe", then you haven't done much to improve the distribution problem. The safe content isn't the stuff that is censored.
- To frame your thought experiments here, consider Xinjiang:
- https://twitter.com/HowellONeill/status/1046781271370690561
- What can actually help in this situation?
- An easier example is Cuba/NK: it seems clear that enhanced offline access and improved support (in LanguageConverter?) for the dialect would help. (But we're not doing that.)
- In this case (and many similar) there is already a robust samizdat data community based on sneakernet of flash drives.
Regarding IPFS: https://twitter.com/cscottnet/status/1044241859676131330
- (I have lots of other thoughts about IPFS as well, dating back to July 2014.)
How to protect privacy
[These are my notes from a conversation with a Wikimedian -- I think Greg Maxwell -- at Grendel's in Harvard Square on Jan 15, 2017. They seem to be naturally related to the other ideas on this page.]
Complete offline copies completely protect anonymity and article history
Tor editing: token scheme, ip->blind token. Fixed factor increase.
99% attention on the trolls who target admins
Compromise tools. Looking at read history, associate users
Pseudonyms not effective if you have ips
Library checkout records and Patriot act
Readers didn't need any privacy
Vandalism increase is good edits increase
What kinds of threats, what can be revealed by what you're reading
Jimmy knew that wikipedia was being captured in 2005, based on juniper docs
Rants of wikimedia-l on privacy
Detect interception. Deliberately poll from places out in the internet, check that (hash of) session keys are the same, to detect mitm attacks. Will cause state actors to not attack because they don't want to be detected. Solicit volunteers to be part of the "wikipedia security and privacy project".
State users can easily bypass stuff checkusers can see.
We discourage checkusers from going on fishing expeditions, which would turn up this stuff
Are there multiple editors who edited this article from the same IP. Bulk tools in some sense are more private. You don't reveal as much per user.
Site can be attacked by biasing article on networks you could control. Parsing article and embedding hash in comment, then run browser grease monkey stuff to check. But false positives on ad injection.
Wikimedia-l should be pushing for public policy. Eg against public propaganda directed at own citizens. No law preventing government targeting WP with edits. We use propaganda outside the US.
Troll army tries to bias, then to destroy. Make editors give up. Editors identities are more or less public. Harass anyone who edits any article, from so many different identities it doesn't look like a single person. And any supporters are attacked.
We avoid right now because people voluntarily decide to stay away from Israel/Palestine. Success gets measured by bad metrics, just if page is blanked often etc.
Only ten edit patrollers. Only thousand editors. Very vulnerable to targeted harassment.
Can't drive off the paid trolls, they are not emotionally invested. It can get good editors banned from the site by pushing emotional buttons. You just have to up your capacity.
Automation.