<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML dir=ltr><HEAD><TITLE>Large-scale DSpace repositories (was Re: [Dspace-general] Another committer)</TITLE>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.6000.16414" name=GENERATOR>

<STYLE></STYLE>

</HEAD>

<BODY bgColor=#ffffff><FONT face=Arial color=#008080 size=2>

<DIV>Hi Stephane,&nbsp;James, and all other DSpace users interested in archival 

storage, </DIV>

<DIV>&nbsp;</DIV>

<DIV>I have been involved in&nbsp;a number of medical projects involving SANs 

and&nbsp;other storage solutions. I&nbsp;see a major storage virtualization 

movement. Ideally&nbsp;a repository like DSpace should have full control over 

the data storage/migration and location process.&nbsp;This currently takes place 

in the IT environment by SAN/Hierarchical storage managers&nbsp;or more advanced 

(Grid) storage&nbsp;broker systems like SRB&nbsp;or HP MAS.&nbsp;The archival 

storage functionality is described&nbsp;well in th OAIS model.&nbsp;A SAN 

is&nbsp;a good solution to start a repository service. It's&nbsp;however not an 

archival storage as described in the OAIS model and is&nbsp;too expensive for 

large multimedia collections. I see a market trend towards deep archvial storage 

outsourced to datacenters, which have petabyte libraries. Active content is 

(highly avialable) on local SAN/NAS solution(s). Ideally the institution should 

have a 3 copy policy using two datacenters with a different legal entity. In the 

Netherlands we are looking into cooperation with the Dutch Sound &amp; Vision 

Foundation that&nbsp;maintains the&nbsp;national broadcast archive (1,5 

PB).&nbsp;The big issue is the connection&nbsp;on the storage management 

level.&nbsp;E.g. different vendor solutions, deinign and setting policies, 

maintaining integrity and getting&nbsp;all operational (IT) info back&nbsp;at a 

higher management reporting level. </DIV>

<DIV>&nbsp;</DIV>

<DIV>To&nbsp;support&nbsp;managing this heterogenous environment&nbsp;at Erasmus 

Medical Center&nbsp;we are&nbsp;trying to create a sort of standard file based 

middle layer.&nbsp;(See enclosed pic).&nbsp;It comprises of&nbsp;a content file, 

a metadata XML file and a policy XML file, which are all linked via the same 

persistent ID filename. We need a very flexible system that can combine 3 

standardisation worlds for institutional repositories. The 

regulatory/compliance/preservation, the domain knowledge of a&nbsp;community and 

the IT world All use their own standards and trying to mix them into one big 

thing, I think will not work.I like the OAIS reference model, but&nbsp;am not 

sure how to best organize its functions into SW modules. I am focussing on the 

Archival Storage function now. This&nbsp;should work with a lot of (commercial) 

Storage Management and IT/Datacenter solutions.&nbsp;For&nbsp;many of these 

solutions, the data integrity and unique persistent link management are new. 

I&nbsp;believe we could design a more or less&nbsp;general preservation/admin 

management function (Preferably in DSpace). The big issue is&nbsp;the management 

of what? Content files, Items, Collections etc. I feel this is related to 

access/ownership and is very community dependent.&nbsp;We are now focussing on 

first getting the content file management part arranged well and later worry 

about managing the higher level relationships. Interestingly the Dutch Sound 

&amp; Vision foundation has an own catalogue development that links user access 

to DRM and its content archive. They said that basically it's a simple rule 

based matrix, very&nbsp;similar to&nbsp;the rule/policies Reagan Moore is 

working on&nbsp;with SRB on the storage management part.&nbsp;My question is 

shouldn't DSpace be developed&nbsp;in a similar&nbsp;way as well? Eg. a very 

simple/stupid archive content/item system based on storage brokers that&nbsp;are 

policy driven from a copy/information lifecycle management perspective&nbsp;plus 

a very flexible access layer that can be tuned to the community needs?</DIV>

<DIV>&nbsp;</DIV>

<DIV>I look forward to everybody's comments on these ideas!</DIV>

<DIV>&nbsp;</DIV>

<DIV>&nbsp;</DIV>

<DIV>Peter Walgemoed<BR><BR></DIV></FONT>

<DIV><FONT face=Arial color=#008080 size=2></FONT>&nbsp;</DIV>

<DIV><FONT face=Arial color=#008080 size=2></FONT>&nbsp;</DIV>

<BLOCKQUOTE 

style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #008080 2px solid; MARGIN-RIGHT: 0px">

  <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>

  <DIV 

  style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B> 

  <A title=stephane.tellier@cgi.com 

  href="mailto:stephane.tellier@cgi.com">Tellier, Stephane</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>To:</B> <A title=james.rutherford@hp.com 

  href="mailto:james.rutherford@hp.com">James Rutherford</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Cc:</B> <A title=dspace-general@mit.edu 

  href="mailto:dspace-general@mit.edu">dspace-general@mit.edu</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Sent:</B> Thursday, March 01, 2007 3:01 

  PM</DIV>

  <DIV style="FONT: 10pt arial"><B>Subject:</B> Re: [Dspace-general] Large-scale 

  DSpace repositories (was Re:Another committer)</DIV>

  <DIV><BR></DIV>

  <DIV id=idOWAReplyText57824 dir=ltr>

  <DIV dir=ltr><FONT face=Arial color=#000000 size=2>Thanks Jim for your 

  answers.</FONT></DIV>

  <DIV dir=ltr><FONT face=Arial size=2></FONT>&nbsp;</DIV>

  <DIV dir=ltr><FONT size=2>&gt; I haven't been involved with the hardware 

  specification for the data<BR>&gt; centres that will be operating, but I could 

  probably get some<BR>&gt; information </FONT></DIV>

  <DIV dir=ltr><FONT size=2></FONT>&nbsp;</DIV>

  <DIV dir=ltr><FONT face=Arial size=2>Any kind of information would be 

  great.</FONT></DIV>

  <DIV dir=ltr><FONT face=Arial size=2></FONT>&nbsp;</DIV>

  <DIV dir=ltr><FONT face=Arial size=2>For our case, we will surely use Oracle 

  for our database because our client (Library and National Archives of Quebec) 

  has an important contract with them.</FONT></DIV>

  <DIV dir=ltr><FONT face=Arial size=2></FONT>&nbsp;</DIV>

  <DIV dir=ltr><FONT face=Arial size=2>We already possess a San server with a 

  lot of disk spaces and we intend to build an architecture with at least 4 or 5 

  servers (see the picture in attachment). The objective here is to isolate the 

  web applications and the import/export and indexation jobs. We also propose in 

  this architecture to install the handle on another different server. Mind that 

  this solution doesn't include any kind of clustering and/or load-balancing 

  since we think that this is something we should do after reaching a certain 

  amount of items because at the beginning, we will probably not have more than 

  50000 items, essentially images, but some of them are very big, like 

  geographical charts (1 gif could be in the range of 500 MB).</FONT></DIV>

  <DIV dir=ltr><FONT face=Arial size=2></FONT>&nbsp;</DIV>

  <DIV dir=ltr><FONT face=Arial size=2>Our&nbsp;digital collections are 

  essentially collections of jpeg or gif images, fully indexable PDFs (5000 to 

  10000), some collections of audio files (not that much), some collections of 

  video files (not much for now) and periodicals consisting of many PDFs which 

  are not indexable since those are "images" of text and for now, none of them 

  was treated with an OCR tool (that will come up eventually). Periodicals are 

  the collections that could raise up our number to a million items. Note that 

  in the periodicals case, we intend to try to find a solution that will prevent 

  us to repeat the metadatas about a periodical through all the items that will 

  contains the PDF files, obviously for performance reasons (see email in 

  dspace-general about Preserving structured collections :</FONT></DIV>

  <DIV dir=ltr><A 

  href="http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html">http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html</A>&nbsp;)</DIV>

  <DIV dir=ltr>&nbsp;</DIV>

  <DIV dir=ltr><FONT face=Arial size=2>Our team will certainly be very 

  interested about DSpace clustering.</FONT></DIV>

  <DIV dir=ltr>&nbsp;</DIV></DIV>

  <DIV dir=ltr><BR>

  <HR tabIndex=-1>

  <FONT face=Tahoma size=2><B>From:</B> James Rutherford 

  [mailto:james.rutherford@hp.com]<BR><B>Sent:</B> Thu 01/03/2007 6:49 

  AM<BR><B>To:</B> Tellier, Stephane<BR><B>Cc:</B> 

  Dspace-general@mit.edu<BR><B>Subject:</B> Large-scale DSpace repositories (was 

  Re: [Dspace-general] Another committer)<BR></FONT><BR></DIV>

  <DIV>

  <P><FONT size=2>Hi Stephane,<BR><BR>On Thu, Feb 22, 2007 at 08:19:52AM -0500, 

  Tellier, Stephane wrote:<BR>&gt; Since you seem to have worked on the China 

  Digital Museum project, I<BR>&gt; was wondering if it could be possible for 

  you to give some<BR>&gt; informations about the hardware specs and the 

  hardware architecture<BR>&gt; (SAN server, load balancing, multiple dspace 

  instances, etc.) about<BR>&gt; that project. If you could send some 

  documentations about it, or refer<BR>&gt; to a web site or wiki explaining 

  these aspects, that would be very<BR>&gt; great.<BR><BR>I haven't been 

  involved with the hardware specification for the data<BR>centres that will be 

  operating, but I could probably get some<BR>information (the estimate is that 

  they will eventually hold ~200TiB of<BR>content each). As for multiple 

  instances, load balancing, etc, myself<BR>and Graham Triggs are looking into 

  clustering mechanisms for DSpace,<BR>both for the database and for the servlet 

  container. If you would like<BR>to contribute to this effort, or read up on 

  what we have found so far, I<BR>suggest you review this page:<BR><BR><A 

  href="http://wiki.dspace.org/HOWTO_Clustering">http://wiki.dspace.org/HOWTO_Clustering</A><BR><BR>This 

  page is very much a work in progress; none of the proposed<BR>mechanisms of 

  clustering on that page have been successful yet (though<BR>we are still 

  working on it). For your project, it may be worth<BR>purchasing clustering 

  services from someone like Oracle (I've not listed<BR>that as an option 

  because I wanted to provide information on what can be<BR>done for 

  free).<BR><BR>&gt; Actually in our team, we're trying to implement a DSpace 

  solution for<BR>&gt; a library and we could expect to have needs for a very 

  large number of<BR>&gt; digital documents (over a million could be a 

  possibility), and we are<BR>&gt; asking ourselves what kind of servers and 

  architecture should we used<BR>&gt; for that range.<BR><BR>This is not an easy 

  question to answer, which is presumably why someone<BR>is paying you to answer 

  it ;) Without knowing more detail about the<BR>typical document type, size, 

  etc, it would be difficult to give any<BR>advice on this. That said, no-one is 

  running a DSpace repository with<BR>more than ~200,000 items, so predicting 

  performance and coming up with<BR>an architecture for repositories with 

  &gt;1,000,000 documents is naturally<BR>rather 

  difficult.<BR><BR>cheers,<BR><BR>Jim<BR></FONT></P></DIV>

  <P>

  <HR>


  <P></P>_______________________________________________<BR>Dspace-general 

  mailing 

  list<BR>Dspace-general@mit.edu<BR>http://mailman.mit.edu/mailman/listinfo/dspace-general<BR></BLOCKQUOTE></BODY></HTML>