<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML dir=ltr><HEAD><TITLE>Large-scale DSpace repositories (was Re: [Dspace-general] Another committer)</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.6000.16414" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff><FONT face=Arial color=#008080 size=2>
<DIV>Hi Stephane, James, and all other DSpace users interested in archival
storage, </DIV>
<DIV> </DIV>
<DIV>I have been involved in a number of medical projects involving SANs
and other storage solutions. I see a major storage virtualization
movement. Ideally a repository like DSpace should have full control over
the data storage/migration and location process. This currently takes place
in the IT environment by SAN/Hierarchical storage managers or more advanced
(Grid) storage broker systems like SRB or HP MAS. The archival
storage functionality is described well in th OAIS model. A SAN
is a good solution to start a repository service. It's however not an
archival storage as described in the OAIS model and is too expensive for
large multimedia collections. I see a market trend towards deep archvial storage
outsourced to datacenters, which have petabyte libraries. Active content is
(highly avialable) on local SAN/NAS solution(s). Ideally the institution should
have a 3 copy policy using two datacenters with a different legal entity. In the
Netherlands we are looking into cooperation with the Dutch Sound & Vision
Foundation that maintains the national broadcast archive (1,5
PB). The big issue is the connection on the storage management
level. E.g. different vendor solutions, deinign and setting policies,
maintaining integrity and getting all operational (IT) info back at a
higher management reporting level. </DIV>
<DIV> </DIV>
<DIV>To support managing this heterogenous environment at Erasmus
Medical Center we are trying to create a sort of standard file based
middle layer. (See enclosed pic). It comprises of a content file,
a metadata XML file and a policy XML file, which are all linked via the same
persistent ID filename. We need a very flexible system that can combine 3
standardisation worlds for institutional repositories. The
regulatory/compliance/preservation, the domain knowledge of a community and
the IT world All use their own standards and trying to mix them into one big
thing, I think will not work.I like the OAIS reference model, but am not
sure how to best organize its functions into SW modules. I am focussing on the
Archival Storage function now. This should work with a lot of (commercial)
Storage Management and IT/Datacenter solutions. For many of these
solutions, the data integrity and unique persistent link management are new.
I believe we could design a more or less general preservation/admin
management function (Preferably in DSpace). The big issue is the management
of what? Content files, Items, Collections etc. I feel this is related to
access/ownership and is very community dependent. We are now focussing on
first getting the content file management part arranged well and later worry
about managing the higher level relationships. Interestingly the Dutch Sound
& Vision foundation has an own catalogue development that links user access
to DRM and its content archive. They said that basically it's a simple rule
based matrix, very similar to the rule/policies Reagan Moore is
working on with SRB on the storage management part. My question is
shouldn't DSpace be developed in a similar way as well? Eg. a very
simple/stupid archive content/item system based on storage brokers that are
policy driven from a copy/information lifecycle management perspective plus
a very flexible access layer that can be tuned to the community needs?</DIV>
<DIV> </DIV>
<DIV>I look forward to everybody's comments on these ideas!</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>Peter Walgemoed<BR><BR></DIV></FONT>
<DIV><FONT face=Arial color=#008080 size=2></FONT> </DIV>
<DIV><FONT face=Arial color=#008080 size=2></FONT> </DIV>
<BLOCKQUOTE
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #008080 2px solid; MARGIN-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=stephane.tellier@cgi.com
href="mailto:stephane.tellier@cgi.com">Tellier, Stephane</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A title=james.rutherford@hp.com
href="mailto:james.rutherford@hp.com">James Rutherford</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Cc:</B> <A title=dspace-general@mit.edu
href="mailto:dspace-general@mit.edu">dspace-general@mit.edu</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Thursday, March 01, 2007 3:01
PM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> Re: [Dspace-general] Large-scale
DSpace repositories (was Re:Another committer)</DIV>
<DIV><BR></DIV>
<DIV id=idOWAReplyText57824 dir=ltr>
<DIV dir=ltr><FONT face=Arial color=#000000 size=2>Thanks Jim for your
answers.</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT size=2>> I haven't been involved with the hardware
specification for the data<BR>> centres that will be operating, but I could
probably get some<BR>> information </FONT></DIV>
<DIV dir=ltr><FONT size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Any kind of information would be
great.</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>For our case, we will surely use Oracle
for our database because our client (Library and National Archives of Quebec)
has an important contract with them.</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>We already possess a San server with a
lot of disk spaces and we intend to build an architecture with at least 4 or 5
servers (see the picture in attachment). The objective here is to isolate the
web applications and the import/export and indexation jobs. We also propose in
this architecture to install the handle on another different server. Mind that
this solution doesn't include any kind of clustering and/or load-balancing
since we think that this is something we should do after reaching a certain
amount of items because at the beginning, we will probably not have more than
50000 items, essentially images, but some of them are very big, like
geographical charts (1 gif could be in the range of 500 MB).</FONT></DIV>
<DIV dir=ltr><FONT face=Arial size=2></FONT> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Our digital collections are
essentially collections of jpeg or gif images, fully indexable PDFs (5000 to
10000), some collections of audio files (not that much), some collections of
video files (not much for now) and periodicals consisting of many PDFs which
are not indexable since those are "images" of text and for now, none of them
was treated with an OCR tool (that will come up eventually). Periodicals are
the collections that could raise up our number to a million items. Note that
in the periodicals case, we intend to try to find a solution that will prevent
us to repeat the metadatas about a periodical through all the items that will
contains the PDF files, obviously for performance reasons (see email in
dspace-general about Preserving structured collections :</FONT></DIV>
<DIV dir=ltr><A
href="http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html">http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html</A> )</DIV>
<DIV dir=ltr> </DIV>
<DIV dir=ltr><FONT face=Arial size=2>Our team will certainly be very
interested about DSpace clustering.</FONT></DIV>
<DIV dir=ltr> </DIV></DIV>
<DIV dir=ltr><BR>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> James Rutherford
[mailto:james.rutherford@hp.com]<BR><B>Sent:</B> Thu 01/03/2007 6:49
AM<BR><B>To:</B> Tellier, Stephane<BR><B>Cc:</B>
Dspace-general@mit.edu<BR><B>Subject:</B> Large-scale DSpace repositories (was
Re: [Dspace-general] Another committer)<BR></FONT><BR></DIV>
<DIV>
<P><FONT size=2>Hi Stephane,<BR><BR>On Thu, Feb 22, 2007 at 08:19:52AM -0500,
Tellier, Stephane wrote:<BR>> Since you seem to have worked on the China
Digital Museum project, I<BR>> was wondering if it could be possible for
you to give some<BR>> informations about the hardware specs and the
hardware architecture<BR>> (SAN server, load balancing, multiple dspace
instances, etc.) about<BR>> that project. If you could send some
documentations about it, or refer<BR>> to a web site or wiki explaining
these aspects, that would be very<BR>> great.<BR><BR>I haven't been
involved with the hardware specification for the data<BR>centres that will be
operating, but I could probably get some<BR>information (the estimate is that
they will eventually hold ~200TiB of<BR>content each). As for multiple
instances, load balancing, etc, myself<BR>and Graham Triggs are looking into
clustering mechanisms for DSpace,<BR>both for the database and for the servlet
container. If you would like<BR>to contribute to this effort, or read up on
what we have found so far, I<BR>suggest you review this page:<BR><BR><A
href="http://wiki.dspace.org/HOWTO_Clustering">http://wiki.dspace.org/HOWTO_Clustering</A><BR><BR>This
page is very much a work in progress; none of the proposed<BR>mechanisms of
clustering on that page have been successful yet (though<BR>we are still
working on it). For your project, it may be worth<BR>purchasing clustering
services from someone like Oracle (I've not listed<BR>that as an option
because I wanted to provide information on what can be<BR>done for
free).<BR><BR>> Actually in our team, we're trying to implement a DSpace
solution for<BR>> a library and we could expect to have needs for a very
large number of<BR>> digital documents (over a million could be a
possibility), and we are<BR>> asking ourselves what kind of servers and
architecture should we used<BR>> for that range.<BR><BR>This is not an easy
question to answer, which is presumably why someone<BR>is paying you to answer
it ;) Without knowing more detail about the<BR>typical document type, size,
etc, it would be difficult to give any<BR>advice on this. That said, no-one is
running a DSpace repository with<BR>more than ~200,000 items, so predicting
performance and coming up with<BR>an architecture for repositories with
>1,000,000 documents is naturally<BR>rather
difficult.<BR><BR>cheers,<BR><BR>Jim<BR></FONT></P></DIV>
<P>
<HR>
<P></P>_______________________________________________<BR>Dspace-general
mailing
list<BR>Dspace-general@mit.edu<BR>http://mailman.mit.edu/mailman/listinfo/dspace-general<BR></BLOCKQUOTE></BODY></HTML>