[Dspace-general] Large-scale DSpace repositories (was Re: Another committer)

Thu Mar 1 09:01:09 EST 2007

Thanks Jim for your answers.

> I haven't been involved with the hardware specification for the data
> centres that will be operating, but I could probably get some
> information 

Any kind of information would be great.

For our case, we will surely use Oracle for our database because our client (Library and National Archives of Quebec) has an important contract with them.

We already possess a San server with a lot of disk spaces and we intend to build an architecture with at least 4 or 5 servers (see the picture in attachment). The objective here is to isolate the web applications and the import/export and indexation jobs. We also propose in this architecture to install the handle on another different server. Mind that this solution doesn't include any kind of clustering and/or load-balancing since we think that this is something we should do after reaching a certain amount of items because at the beginning, we will probably not have more than 50000 items, essentially images, but some of them are very big, like geographical charts (1 gif could be in the range of 500 MB).

Our digital collections are essentially collections of jpeg or gif images, fully indexable PDFs (5000 to 10000), some collections of audio files (not that much), some collections of video files (not much for now) and periodicals consisting of many PDFs which are not indexable since those are "images" of text and for now, none of them was treated with an OCR tool (that will come up eventually). Periodicals are the collections that could raise up our number to a million items. Note that in the periodicals case, we intend to try to find a solution that will prevent us to repeat the metadatas about a periodical through all the items that will contains the PDF files, obviously for performance reasons (see email in dspace-general about Preserving structured collections :
http://mailman.mit.edu/pipermail/dspace-general/2007-February/001369.html )

Our team will certainly be very interested about DSpace clustering.

________________________________

From: James Rutherford [mailto:james.rutherford at hp.com]
Sent: Thu 01/03/2007 6:49 AM
To: Tellier, Stephane
Cc: Dspace-general at mit.edu
Subject: Large-scale DSpace repositories (was Re: [Dspace-general] Another committer)

Hi Stephane,

On Thu, Feb 22, 2007 at 08:19:52AM -0500, Tellier, Stephane wrote:
> Since you seem to have worked on the China Digital Museum project, I
> was wondering if it could be possible for you to give some
> informations about the hardware specs and the hardware architecture
> (SAN server, load balancing, multiple dspace instances, etc.) about
> that project. If you could send some documentations about it, or refer
> to a web site or wiki explaining these aspects, that would be very
> great.

I haven't been involved with the hardware specification for the data
centres that will be operating, but I could probably get some
information (the estimate is that they will eventually hold ~200TiB of
content each). As for multiple instances, load balancing, etc, myself
and Graham Triggs are looking into clustering mechanisms for DSpace,
both for the database and for the servlet container. If you would like
to contribute to this effort, or read up on what we have found so far, I
suggest you review this page:

http://wiki.dspace.org/HOWTO_Clustering

This page is very much a work in progress; none of the proposed
mechanisms of clustering on that page have been successful yet (though
we are still working on it). For your project, it may be worth
purchasing clustering services from someone like Oracle (I've not listed
that as an option because I wanted to provide information on what can be
done for free).

> Actually in our team, we're trying to implement a DSpace solution for
> a library and we could expect to have needs for a very large number of
> digital documents (over a million could be a possibility), and we are
> asking ourselves what kind of servers and architecture should we used
> for that range.

This is not an easy question to answer, which is presumably why someone
is paying you to answer it ;) Without knowing more detail about the
typical document type, size, etc, it would be difficult to give any
advice on this. That said, no-one is running a DSpace repository with
more than ~200,000 items, so predicting performance and coming up with
an architecture for repositories with >1,000,000 documents is naturally
rather difficult.

cheers,

Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070301/bc78fc0c/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DSpace Network Diagram (prototype).jpg
Type: image/pjpeg
Size: 270972 bytes
Desc: DSpace Network Diagram (prototype).jpg
Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20070301/bc78fc0c/attachment.bin