[Dspace-general] DSpace lareg scale and general architecture direction?

Wed Mar 7 02:35:45 EST 2007

Large-scale DSpace repositories (was Re: [Dspace-general] Another committer)Hi Stephane, James, and all other DSpace users interested in archival storage, 

I have been involved in a number of medical projects involving SANs and other storage solutions. I see a major storage virtualization movement. Ideally a repository like DSpace should have full control over the data storage/migration and location process. This currently takes place in the IT environment by SAN/Hierarchical storage managers or more advanced (Grid) storage broker systems like SRB or HP MAS. The archival storage functionality is described well in th OAIS model. A SAN is a good solution to start a repository service. It's however not an archival storage as described in the OAIS model and is too expensive for large multimedia collections. I see a market trend towards deep archvial storage outsourced to datacenters, which have petabyte libraries. Active content is (highly avialable) on local SAN/NAS solution(s). Ideally the institution should have a 3 copy policy using two datacenters with a different legal entity. In the Netherlands we are looking into cooperation with the Dutch Sound & Vision Foundation that maintains the national broadcast archive (1,5 PB). The big issue is the connection on the storage management level. E.g. different vendor solutions, deinign and setting policies, maintaining integrity and getting all operational (IT) info back at a higher management reporting level. 

To support managing this heterogenous environment at Erasmus Medical Center we are trying to create a sort of standard file based middle layer. (See enclosed pic). It comprises of a content file, a metadata XML file and a policy XML file, which are all linked via the same persistent ID filename. We need a very flexible system that can combine 3 standardisation worlds for institutional repositories. The regulatory/compliance/preservation, the domain knowledge of a community and the IT world All use their own standards and trying to mix them into one big thing, I think will not work.I like the OAIS reference model, but am not sure how to best organize its functions into SW modules. I am focussing on the Archival Storage function now. This should work with a lot of (commercial) Storage Management and IT/Datacenter solutions. For many of these solutions, the data integrity and unique persistent link management are new. I believe we could design a more or less general preservation/admin management function (Preferably in DSpace). The big issue is the management of what? Content files, Items, Collections etc. I feel this is related to access/ownership and is very community dependent. We are now focussing on first getting the content file management part arranged well and later worry about managing the higher level relationships. Interestingly the Dutch Sound & Vision foundation has an own catalogue development that links user access to DRM and its content archive. They said that basically it's a simple rule based matrix, very similar to the rule/policies Reagan Moore is working on with SRB on the storage management part. My question is shouldn't DSpace be developed in a similar way as well? Eg. a very simple/stupid archive content/item system based on storage brokers that are policy driven from a copy/information lifecycle management perspective plus a very flexible access layer that can be tuned to the community needs?

I look forward to everybody's comments on these ideas!

Peter Walgemoed

  ----- Original Message ----- 
  From: Tellier, Stephane 
  To: James Rutherford 
  Cc: dspace-general at mit.edu 
  Sent: Thursday, March 01, 2007 3:01 PM
  Subject: Re: [Dspace-general] Large-scale DSpace repositories (was Re:Another committer)

  Thanks Jim for your answers.

  > I haven't been involved with the hardware specification for the data
  > centres that will be operating, but I could probably get some
  > information 

  Any kind of information would be great.

  For our case, we will surely use Oracle for our database because our client (Library and National Archives of Quebec) has an important contract with them.

  We already possess a San server with a lot of disk spaces and we intend to build an architecture with at least 4 or 5 servers (see the picture in attachment). The objective here is to isolate the web applications and the import/export and indexation jobs. We also propose in this architecture to install the handle on another different server. Mind that this solution doesn't include any kind of clustering and/or load-balancing since we think that this is something we should do after reaching a certain amount of items because at the beginning, we will probably not have more than 50000 items, essentially images, but some of them are very big, like geographical charts (1 gif could be in the range of 500 MB).

  Our digital collections are essentially collections of jpeg or gif images, fully indexable PDFs (5000 to 10000), some collections of audio files (not that much), some collections of video files (not much for now) and periodicals consisting of many PDFs which are not indexable since those are "images" of text and for now, none of them was treated with an OCR tool (that will come up eventually). Periodicals are the collections that could raise up our number to a million items. Note that in the periodicals case, we intend to try to find a solution that will prevent us to repeat the metadatas about a periodical through all the items that will contains the PDF files, obviously for performance reasons (see email in dspace-general about Preserving structured collections :
  http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html )

  Our team will certainly be very interested about DSpace clustering.

------------------------------------------------------------------------------
  From: James Rutherford [mailto:james.rutherford at hp.com]
  Sent: Thu 01/03/2007 6:49 AM
  To: Tellier, Stephane
  Cc: Dspace-general at mit.edu
  Subject: Large-scale DSpace repositories (was Re: [Dspace-general] Another committer)

  Hi Stephane,

  On Thu, Feb 22, 2007 at 08:19:52AM -0500, Tellier, Stephane wrote:
  > Since you seem to have worked on the China Digital Museum project, I
  > was wondering if it could be possible for you to give some
  > informations about the hardware specs and the hardware architecture
  > (SAN server, load balancing, multiple dspace instances, etc.) about
  > that project. If you could send some documentations about it, or refer
  > to a web site or wiki explaining these aspects, that would be very
  > great.

  I haven't been involved with the hardware specification for the data
  centres that will be operating, but I could probably get some
  information (the estimate is that they will eventually hold ~200TiB of
  content each). As for multiple instances, load balancing, etc, myself
  and Graham Triggs are looking into clustering mechanisms for DSpace,
  both for the database and for the servlet container. If you would like
  to contribute to this effort, or read up on what we have found so far, I
  suggest you review this page:

  http://wiki.dspace.org/HOWTO_Clustering

  This page is very much a work in progress; none of the proposed
  mechanisms of clustering on that page have been successful yet (though
  we are still working on it). For your project, it may be worth
  purchasing clustering services from someone like Oracle (I've not listed
  that as an option because I wanted to provide information on what can be
  done for free).

  > Actually in our team, we're trying to implement a DSpace solution for
  > a library and we could expect to have needs for a very large number of
  > digital documents (over a million could be a possibility), and we are
  > asking ourselves what kind of servers and architecture should we used
  > for that range.

  This is not an easy question to answer, which is presumably why someone
  is paying you to answer it ;) Without knowing more detail about the
  typical document type, size, etc, it would be difficult to give any
  advice on this. That said, no-one is running a DSpace repository with
  more than ~200,000 items, so predicting performance and coming up with
  an architecture for repositories with >1,000,000 documents is naturally
  rather difficult.

  cheers,

  Jim

------------------------------------------------------------------------------

  _______________________________________________
  Dspace-general mailing list
  Dspace-general at mit.edu
  http://mailman.mit.edu/mailman/listinfo/dspace-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070307/d9a993d0/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Director Service.jpg
Type: image/jpeg
Size: 64576 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20070307/d9a993d0/attachment.jpg