[OWW-Discuss] Fwd: Proposal to 'Wikify' GenBank Meets Stiff Resistance

Tue Mar 25 11:16:43 EDT 2008

I'm excited by the discussion. Let me share a few observations.

One of the real problems we're dealing with at the moment is that
'wikifying' is, in theory, a great thing to do to stimulate discussion and
collaboration. To date, we've found that finding specific information within
MediaWiki can be daunting for many users.  One suggestion for documents
generated by a wiki has been,

*"Expose 'em all, an' let Google sort 'em out".
*
Brute force searching of this data may not be the best, or at least the
only, way to make navigation through wiki data useful. If this data set is
to have "legs" and grow, consider the existing wiki paradigm and how it
applies to this effort.

If a dataset this large is to be put into a format that can be manipulated
via many contributors, I'd strongly suggest a few things:

1. Before deciding to come up with a new storage structure such as SVN for
creating such a massive dataset, see if MW can be used. In theory, it's more
than capable of handling 2 million discrete entries, the current size of the
Wikipedia dataset. I have no idea how many revisions there are but the
number must be in excess of 10 million. Add to this a talk page per document
and you have a LOT of data.

I know you folks can point to millions of lines of source code under
management. But repurposing SVN to have a key role in a new system will
require changes. For instance, there's no way to use SVN to directly enforce
submission standards in entries in the system. Sure: we can specify filters
or other gatekeepers to do it. But this is what MediaWiki is, at its core.
The travesty that templates in the system have become notwithstanding, the
core is reliably handling a lot of documents. Don't underestimate the kind
of problems the WikiMedia team has overcome to get where they are now.

2. Have a clear delineation between pages created within the system that
directly reflect a specific entry in one of the biological databases and
those created on a more 'ad-hoc' basis by users of such a system. Our
experience has shown that docs defining a specific protocol are hard to sort
out from those referencing such protocols. We're concretely dealing with
this right now. Full-text indexing doesn't currently help with this.
Categories are too loosely defined. We don't really have a good taxonomy,
never mind an ontology! We're moving toward a relevance-ranked search
(Lucene) but even this won't pull apart all of the information we think our
users come to us looking for.

3. Explore using the Lucene Search engine rather than the default MediaWiki
engine. The number of documents in this system would greatly exceed the
capacity of MySQL's noble but potentially useless search engine that is the
MW default. Estimates are that Lucence can provide an order of magnitude
performance increase over the MySQL search. It can be located on a separate
server. One instance can support multiple clustered servers.

4. Programatically enforce the way the taxonomy of the information as well
as other metadata is managed. My current experience with OWW indicates that
the value of the content will be greatly enhanced if the community founded
around the enhancement of the core information will use tools that, over
time, drive the general creation of computer process-friendly and
user-friendly data (knowledge, dare I say?). Otherwise, this effort may find
limits to the utility of the data collection.

5. This may be an appropriate use for something like the Semantic MediaWiki
extensions. If the 'shape' of a certain class of data can be maintained and
enhanced, the ability to connect these entities into other higher level
abstractions will be dramatically improved. Please don't quote me on this.
The jury is still WAY out on the scalability of Semantic MediaWiki. But some
kind of focused editor for managing data entry requirements is imperative.

6. There currently is no other open source solution, in PHP, Python, Java,
Ruby, Perl, or who knows what, that is driving the search and update volume
that Wikipedia sustains. GenBank and other bioinformatic databases may see a
greater volume under much more real-time driven constraints. But none are
allowing hundreds of millions of people to view and at least tens of
thousands of people to create and edit document simultaneously. In this
sense, I wouldn't encourage adopting Semantic Mediawiki until the developer
community has similar sized datasets that are actively being used by their
communities.

7. Don't assume the MW concept of a Wiki is appropriate for all aspects of
the management of this dataset. MW can provide the fabric and storage model
of the system. But, a more structured set of tools may be the best way to
manage aspects of this data. Currently, the model is to add extensions to
the core MW application to do this. I can't agree this is the best way to
proceed. The MW framework is a terrible platform for designing programs.
It's more like writing assembler language for a computer designed by kids
stuck in a junior high school detention room than doing contemporary
development work. This disinvites innovation by making access and extension
of the dataset the province of the Initiate rather than the many people who
can program but lack experience with MediaWiki.

Conclusion

Putting data onto a server is one thing. Making it useful and sustainable is
quite another thing. I'm 100% behind doing something to open up control of
the annotation and commentary around this kind of data. I'm just not
convinced that this exercise will lead to a useful utility for doing great
science unless it's viewed as such and not as a data management exercise. I
hope this doesn't come off as sour grapes. If you want to do something right
that makes a difference to a lot of researchers, please look at the scope of
the end-to-end problem you're trying to solve.

On Tue, Mar 25, 2008 at 9:57 AM, Dan Bolser <dan.bolser at gmail.com> wrote:

>
> On 25/03/2008, Austin Che <austin at csail.mit.edu> wrote:
>
> >
> > > a) grab all the data from the PDB (for example)
> > > b) stick all the data into a revisioning system
> > > c) allow users to freely edit the data, including automatic clean up
> > 'bots',
> > > algorithms, etc., etc.
> > > d) have all changes automatically emailed to a mailing list for
> > community
> > > review, approval etc.
> > >
> > > Now, once we get to step d, in the time since step a, the PDB data has
> > been
> > > updated by the PDB. We now need to merge the updated PDB data with our
> > > independently modified data. (This is where we need to go beyond a
> > simple
> > > revisioning system).
> >
> >
> >     That was supposed to be the goal of wikiproteins
> >     http://www.wikiproteins.org/Meta:Wikiproteins
> >     Roughly speaking, it does exactly the above. It works with
> >     databases like swiss-prot and downloads all the data into a
> >     semantic wiki form allowing edits. The changes made by the
> >     community can be kept separate from the "authoritative" version
> >     and periodically it's sent back to the original source for
> >     updating. At least that's the idea. They collaborate with those
> >     databases so not all of the databases are resisting this idea.
>
>
>
> Yeah, I like the idea, and its good to see database developers on board
> with that project. I spent some time hanging out near some wikiproteins
> developers (as close as I could get) in irc://irc.freenode.net/#omegawiki -
> I was disappointed at their lack of communication. For example, I have
> worked on a Swiss-Prot parser, so I was curious as to how they had decided
> to relationally model the data (if at all),  but I couldn't get any feed
> back from them. Perhaps I was in the wrong place / asking the wrong
> questions / etc.
>
> Also some of the concepts that they expound at wikiproteins are really
> confusing, and so far they have not decided to answer my questions about
> them.  For example,
>
> http://www.wikiproteins.org/Partner_talk:Knewco
> http://www.wikiproteins.org/Meta_talk:Wikiproteins
>
> http://www.wikiproteins.org/Meta_talk:Personal_Desktop
> http://www.wikiproteins.org/Meta_talk:Alerts
>
>
> I am sure wikiproteins will deliver, I am just not sure when (or what
> really)... Did anyone try the new beta release?
>
>
>
> --
> > Austin Che           <austin at csail.mit.edu>          (617)253-5899
> >
>
>
>
> --
> hello
>
> --
> hello
> _______________________________________________
> OpenWetWare Discussion Mailing List
> discuss at openwetware.org
> http://mailman.mit.edu/mailman/listinfo/oww-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/oww-discuss/attachments/20080325/99d915e7/attachment.htm