2. Planning



1.1.    Liaise with all interested parties in order to gather detailed information about what they need and to anticipate what they might need in the future.
       At the project's inception it was unclear who would be responsible for certain areas therefore, it was initially difficult to plan ahead for the necessary storage facilities. This situation resolved itself as areas of responsibility became clearly delineated.

        2.2 Identify problems and obstacles to be overcome whilst meeting these needs.
These were mentioned earlier: firstly, there was no network to link the two separate locations involved in the project, which meant no easy backup or access to the data. Secondly, the data storage requirements were so large that it would be a challenge to meet them and finally, I would have very little or no IT support.
2.3 Evaluate what software should be used - taking into account features, performance, price, support and scalability.
In order to save money I decided to implement well tested solutions: CentOS Linux 64bit, MySQL, Apache with support of C, Java, PHP, Perl, Python and other free software solutions. All the software installed on servers is free (GNU software License). Some examples of my reasoning here are elaborated below.
CentOS is a free, community built version of RedHat Linux. If one pays for Redhat, one is paying for support, as I am extremely familiar with the software I decided that I wouldn't need any support.
MySQL – a very good relational database management system. I had considered PostgreSQL as it has more functionality but decided to go with MySQL, as it has superior support. I did not believe that I  needed more sophisticated solutions than MySQL could provide and rezoned that I could always switch to MariaDB, which is a community built version of MySQL (MySQL is owned by ORACLE) or PostgreSQL if necessary.
Apache – I am very familiar with this software, furthermore it is the most popular web server on the Internet.

I think that the free software I chose has a combination of great performance, flexibility, reliability and scalability. Any shortfalls in functionality, could be compensated for by writing my own code.

Finally, I had to consider the issues of security and privacy.  I could not allow closed (proprietary) software to be in control of the sensitive data that I am responsible for as I needed to have the ability to inspect the source code in order to be able to evaluate any possible back door in order that the system be secure should it eventually be connected to the internet. Open source software also provides a safety net that threats are easily identified by online community.
2.4 Evaluate what hardware should be used – taking into account the project requirements, pricing, support and scalability.
This was difficult since the prices of servers are very high. The older solutions, are not very supportive of newer technologies (for example USB 3.0 or SATA 3.) I needed super-fast data transfers since I had no network – most of the data would have to be transferred on encrypted hard drives. A regular infrastructure based on proprietary software would initially cost some £ 60,000 and then approximately £ 20,000 a year for licensing fees, which was clearly beyond my budget. Branded solutions offering USB 2.0 and SATA 2 were not sufficient for my purpose. I needed large, fast and cheap SATA 3 hard drives (there is no need to pay extra for SAS or SSD – no need for quick I/O, available WD RE SATA 3 drives are sufficient enough offering large capacities for a very good price) fast controllers (6GB/s) and super-fast USB (5GB/s). The operating systems in both nicola1 and nicola2 servers are installed on 240 GB Kingston SSD drives, keeping working directories on additional WD RE hard drive and uses as a storage a Directly Attached Storage (DAS) system by Areca ARC-8040 (8-Bays 6Gb/s SAS to SAS RAID Subsystem), each with eight 4TB WD RE drives on board. This infrastructure allows me to temporarily use SSD hard drives installed in USB 3.0 enclosures supporting encryption to move the data between locations.
I ended up with four servers, two of them having storage of 32TB, RAID 10, with an effective space of 16TB on each of them, the other two have RAID 1 software for reliability and support the CAPI demographics database and blood samples database. All data is copied on a weekly basis to the nicola1 server.
The final requirement was for the quiet operation of these drives as one of the servers is located in a clinic. Water cooling, large fans and automatic fan optimisation were needed to solve this problem.
      2.5 Networking
Although there is no external network in place, the system is designed in such a way, that it would be easy to switch to a networked version if a network were to be installed. There is however a very limited LAN in place connecting local laptops to the nicola2 server. It supports DHCP, Apache, SSL/TLS, SMB protocol and SSH, so the workstations can share and upload files over the LAN and there are PHP/MySQL based web form GUI (Graphical User Interface) system in order to enter data into the database from locally connected laptops over HTTPS, this is a precautionary security measure in case a network is installed. I decided that it would be best as an interim measure that the data be transferred weekly from nicola2 (the NIHSC located, non-networked server) to nicola1 (networked server, available to researchers) on a pair of external encrypted SSD drives enclosed in a USB 3.0 enclosure. A pair of drives could be swapped to make the transfer processes easier and faster: when one drive is in the NIHSC, the other is in UD (the NICOLA office) and vice versa.

No comments:

Post a Comment