1.1. Liaise with all interested parties in order
to gather detailed information about what they need and to anticipate what they
might need in the future.
At the project's inception it was unclear who
would be responsible for certain areas therefore, it was initially difficult to
plan ahead for the necessary storage facilities. This situation resolved itself
as areas of responsibility became clearly delineated.
2.2 Identify problems and obstacles to
be overcome whilst meeting these needs.
These were mentioned earlier: firstly,
there was no network to link the two separate locations involved in the
project, which meant no easy backup or access to the data. Secondly, the data
storage requirements were so large that it would be a challenge to meet them
and finally, I would have very little or no IT support.
2.3 Evaluate what software should be used - taking into account
features, performance, price, support
and scalability.
In order to save money I decided
to implement well tested solutions: CentOS Linux 64bit, MySQL, Apache with
support of C, Java, PHP, Perl, Python and other free software solutions. All the
software installed on servers is free (GNU software License). Some examples of
my reasoning here are elaborated below.
CentOS is a free, community
built version of RedHat Linux. If one pays for Redhat, one is paying for
support, as I am extremely familiar with the software I decided that I wouldn't
need any support.
MySQL – a very good relational
database management system. I had considered PostgreSQL as it has more
functionality but decided to go with MySQL, as it has superior support. I did not
believe that I needed more sophisticated
solutions than MySQL could provide and rezoned that I could always switch to
MariaDB, which is a community built version of MySQL (MySQL is owned by ORACLE)
or PostgreSQL if necessary.
Apache – I am very familiar
with this software, furthermore it is the most popular web server on the
Internet.
I think that the free software I chose has a combination of great
performance, flexibility, reliability and scalability. Any shortfalls in
functionality, could be compensated for by writing my own code.
Finally,
I had to consider the issues of security and privacy. I could not allow closed (proprietary)
software to be in control of the sensitive data that I am responsible for as I
needed to have the ability to inspect the source code in order to be able to
evaluate any possible back door in order that the system be secure should it
eventually be connected to the internet. Open source software also provides a
safety net that threats are easily identified by online community.
2.4 Evaluate what hardware should be used – taking into account the
project requirements, pricing,
support and scalability.
This was difficult since the prices of servers are very high. The older
solutions, are not very supportive of newer technologies (for example USB 3.0
or SATA 3.) I needed super-fast data transfers since I had no network – most of
the data would have to be transferred on encrypted hard drives. A regular infrastructure
based on proprietary software would initially cost some £ 60,000 and then approximately
£ 20,000 a year for licensing fees, which was clearly beyond my budget. Branded
solutions offering USB 2.0 and SATA 2 were not sufficient for my purpose. I needed
large, fast and cheap SATA 3 hard drives (there is no need to pay extra for SAS
or SSD – no need for quick I/O, available WD RE SATA 3 drives are sufficient
enough offering large capacities for a very good price) fast controllers
(6GB/s) and super-fast USB (5GB/s). The operating systems in both nicola1 and
nicola2 servers are installed on 240 GB Kingston SSD drives, keeping working
directories on additional WD RE hard drive and uses as a storage a Directly
Attached Storage (DAS) system by Areca ARC-8040 (8-Bays 6Gb/s SAS to SAS RAID
Subsystem), each with eight 4TB WD RE drives on board. This infrastructure
allows me to temporarily use SSD hard drives installed in USB 3.0 enclosures
supporting encryption to move the data between locations.
I ended up with four servers, two of them having storage of 32TB, RAID
10, with an effective space of 16TB on each of them, the other two have RAID 1 software
for reliability and support the CAPI demographics database and blood samples
database. All data is copied on a weekly basis to the nicola1 server.
The
final requirement was for the quiet operation of these drives as one of the
servers is located in a clinic. Water cooling, large fans and automatic fan
optimisation were needed to solve this problem.
2.5 Networking
Although
there is no external network in place, the system is designed in such a way,
that it would be easy to switch to a networked version if a network were to be installed.
There is however a very limited LAN in place connecting local laptops to the nicola2
server. It supports DHCP, Apache, SSL/TLS, SMB protocol and SSH, so the workstations
can share and upload files over the LAN and there are PHP/MySQL based web form
GUI (Graphical User Interface) system in order to enter data into the database from
locally connected laptops over HTTPS, this is a precautionary security measure
in case a network is installed. I decided that it would be best as an interim
measure that the data be transferred weekly from nicola2 (the NIHSC located,
non-networked server) to nicola1 (networked server, available to researchers)
on a pair of external encrypted SSD drives enclosed in a USB 3.0 enclosure. A pair
of drives could be swapped to make the transfer processes easier and faster:
when one drive is in the NIHSC, the other is in UD (the NICOLA
office) and vice versa.
No comments:
Post a Comment