I'm excited by the discussion. Let me share a few observations. <br><br>One of the real problems we're dealing with at the moment is that 'wikifying' is, in theory, a great thing to do to stimulate discussion and collaboration. To date, we've found that finding specific information within MediaWiki can be daunting for many users. One suggestion for documents generated by a wiki has been, <br>
<br><b>"Expose 'em all, an' let Google sort 'em out".<br></b><br>Brute force searching of this data may not be the best, or at least the only, way to make navigation through wiki data useful. If this data set is to have "legs" and grow, consider the existing wiki paradigm and how it applies to this effort. <br>
<br>If a dataset this large is to be put into a format that can be manipulated via many contributors, I'd strongly suggest a few things:<br><br>1. Before deciding to come up with a new storage structure such as SVN for creating such a massive dataset, see if MW can be used. In theory, it's more than capable of handling 2 million discrete entries, the current size of the Wikipedia dataset. I have no idea how many revisions there are but the number must be in excess of 10 million. Add to this a talk page per document and you have a LOT of data. <br>
<br>I know you folks can point to millions of lines of source code under management. But repurposing SVN to have a key role in a new system will require changes. For instance, there's no way to use SVN to directly enforce submission standards in entries in the system. Sure: we can specify filters or other gatekeepers to do it. But this is what MediaWiki is, at its core. The travesty that templates in the system have become notwithstanding, the core is reliably handling a lot of documents. Don't underestimate the kind of problems the WikiMedia team has overcome to get where they are now. <br>
<br>2. Have a clear delineation between pages created within the system
that directly reflect a specific entry in one of the biological
databases and those created on a more 'ad-hoc' basis by users of such a
system. Our experience has shown that docs defining a specific protocol
are hard to sort out from those referencing such protocols. We're
concretely dealing with this right now. Full-text indexing doesn't
currently help with this. Categories are too loosely defined. We don't really have a good taxonomy, never mind an ontology! We're moving toward a relevance-ranked search
(Lucene) but even this won't pull apart all of the information we think
our users come to us looking for. <br>
<br>3. Explore using the Lucene Search engine rather than the default
MediaWiki engine. The number of documents in this system would greatly
exceed the capacity of MySQL's noble but potentially useless search
engine that is the MW default. Estimates are that Lucence can provide
an order of magnitude performance increase over the MySQL search. It
can be located on a separate server. One instance can support multiple
clustered servers. <br>
<br>4. Programatically enforce the way the taxonomy of the information as well as other metadata is managed. My current experience with OWW indicates that the value of the content will be greatly enhanced if the community founded around the enhancement of the core information will use tools that, over time, drive the general creation of computer process-friendly and user-friendly data (knowledge, dare I say?). Otherwise, this effort may find limits to the utility of the data collection.<br>
<br>5. This may be an appropriate use for something like the Semantic MediaWiki extensions. If the 'shape' of a certain class of data can be maintained and enhanced, the ability to connect these entities into other higher level abstractions will be dramatically improved. Please don't quote me on this. The jury is still WAY out on the scalability of Semantic MediaWiki. But some kind of focused editor for managing data entry requirements is imperative. <br>
<br>6. There currently is no other open source solution, in PHP, Python, Java, Ruby, Perl, or who knows what, that is driving the search and update volume that Wikipedia sustains. GenBank and other bioinformatic databases may see a greater volume under much more real-time driven constraints. But none are allowing hundreds of millions of people to view and at least tens of thousands of people to create and edit document simultaneously. In this sense, I wouldn't encourage adopting Semantic Mediawiki until the developer community has similar sized datasets that are actively being used by their communities. <br>
<br>7. Don't assume the MW concept of a Wiki is appropriate for all aspects of the management of this dataset. MW can provide the fabric and storage model of the system. But, a more structured set of tools may be the best way to manage aspects of this data. Currently, the model is to add extensions to the core MW application to do this. I can't agree this is the best way to proceed. The MW framework is a terrible platform for designing programs. It's more like writing assembler language for a computer designed by kids stuck in a junior high school detention room than doing contemporary development work. This disinvites innovation by making access and extension of the dataset the province of the Initiate rather than the many people who can program but lack experience with MediaWiki. <br>
<br>Conclusion<br><br>Putting data onto a server is one thing. Making it useful and sustainable is quite another thing. I'm 100% behind doing something to open up control of the annotation and commentary around this kind of data. I'm just not convinced that this exercise will lead to a useful utility for doing great science unless it's viewed as such and not as a data management exercise. I hope this doesn't come off as sour grapes. If you want to do something right that makes a difference to a lot of researchers, please look at the scope of the end-to-end problem you're trying to solve. <br>
<br><br><div class="gmail_quote">On Tue, Mar 25, 2008 at 9:57 AM, Dan Bolser <<a href="mailto:dan.bolser@gmail.com" target="_blank">dan.bolser@gmail.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><span class="gmail_quote"></span><br><div><div><span><div><span class="gmail_quote">On 25/03/2008, <b class="gmail_sendername">Austin Che</b> <<a href="mailto:austin@csail.mit.edu" target="_blank">austin@csail.mit.edu</a>> wrote:</span></div>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>
<br> > a) grab all the data from the PDB (for example)<br> > b) stick all the data into a revisioning system<br> > c) allow users to freely edit the data, including automatic clean up 'bots',<br> > algorithms, etc., etc.<br>
> d) have all changes automatically emailed to a mailing list for community<br> > review, approval etc.<br> ><br> > Now, once we get to step d, in the time since step a, the PDB data has been<br> > updated by the PDB. We now need to merge the updated PDB data with our<br>
> independently modified data. (This is where we need to go beyond a simple<br> > revisioning system).<br> <br> <br></div> That was supposed to be the goal of wikiproteins<br> <a href="http://www.wikiproteins.org/Meta:Wikiproteins" target="_blank">http://www.wikiproteins.org/Meta:Wikiproteins</a><br>
Roughly speaking, it does exactly the above. It works with<br> databases like swiss-prot and downloads all the data into a<br> semantic wiki form allowing edits. The changes made by the<br> community can be kept separate from the "authoritative" version<br>
and periodically it's sent back to the original source for<br> updating. At least that's the idea. They collaborate with those<br> databases so not all of the databases are resisting this idea.</blockquote>
</span></div><div><br><br>Yeah, I like the idea, and its good to see database developers on board with that project. I spent some time hanging out near some wikiproteins developers (as close as I could get) in irc://irc.freenode.net/#omegawiki - I was disappointed at their lack of communication. For example, I have worked on a Swiss-Prot parser, so I was curious as to how they had decided to relationally model the data (if at all), but I couldn't get any feed back from them. Perhaps I was in the wrong place / asking the wrong questions / etc.<br>
<br>Also some of the concepts that they expound at wikiproteins are really confusing, and so far they have not decided to answer my questions about them. For example, <br><br><a href="http://www.wikiproteins.org/Partner_talk:Knewco" target="_blank">http://www.wikiproteins.org/Partner_talk:Knewco</a><br>
<a href="http://www.wikiproteins.org/Meta_talk:Wikiproteins" target="_blank">http://www.wikiproteins.org/Meta_talk:Wikiproteins</a><br><br><a href="http://www.wikiproteins.org/Meta_talk:Personal_Desktop" target="_blank">http://www.wikiproteins.org/Meta_talk:Personal_Desktop</a><br>
<a href="http://www.wikiproteins.org/Meta_talk:Alerts" target="_blank">http://www.wikiproteins.org/Meta_talk:Alerts</a><br><br><br>I am sure wikiproteins will deliver, I am just not sure when (or what really)... Did anyone try the new beta release?<br>
<br>
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> --<div><span><br> <span name="st">Austin</span> Che <<a href="mailto:austin@csail.mit.edu" target="_blank"><span name="st">austin</span>@csail.mit.edu</a>> (617)253-5899<br>
</span></div></blockquote></div><span><br><br clear="all"><br>-- <br>hello
</span><br clear="all"><font color="#888888"><br>-- <br>hello
</font><br>_______________________________________________<br>
OpenWetWare Discussion Mailing List<br>
<a href="mailto:discuss@openwetware.org" target="_blank">discuss@openwetware.org</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/oww-discuss" target="_blank">http://mailman.mit.edu/mailman/listinfo/oww-discuss</a><br>
<br></blockquote></div><br>