The other day I was asked to find a way to send sensitive clinical data to another institute. How to make sure that the data is protected and only acessible to the right people? There are two aspects of protecting data, reflecting the different risks which the data may be exposed to:
- data in transit (email “in flight”, web or FTP downloads, data sets on USB disks shipped by FedEx, etc)
- data at rest (email arrived in recipient’s inbox, data copied to collaborator’s working disk, etc)
Here we will only explore the requirements for encrypting data in transit. The security of the data at rest is assumed to be taken care of by the collaborator or their IT staff, since it is outside one’s control.
There are various possible file transfer methods:
- email – suitable for small files (typically up to 5MB although different sites impose different limits); no automatic encryption in transit
- FTP or non-SSL password-protected web site – suitable for large files (in the GB range); no automatic encryption in transit
- scp – suitable for large files; intrinsic encryption in transit; likely to encounter firewall issues
- password-protected SSL web site – suitable for large files; intrinsic encryption in transit
- USB disk – suitable for very large data sets (TB range); no automatic encryption in transit
When encryption is mandated (e.g. by a data access agreement) and the file transfer method does not provide encryption intrinsically, it is necessary to encrypt the data separately and transfer the encrypted file by the chosen method.
For ad-hoc or one-off data encryption, it is appropriate to encrypt a data set with a password (“symmetric encryption”, because the same password is used to encrypt and decrypt) which will be sent to the recipient by a separate means to the actual data. For example, if the data is shipped on a USB disk, the password could be sent by email, or given over the phone. Sending the password with the encrypted data defeats the object of encrypting it!
For regular or scheduled data transfers, public-key encryption may be suitable – and removes the need to send a password – but that will not be explored here due to the extra work in creating and managing keys.
A suitable encryption tool on Linux systems is gpg (the GNU Privacy Guard). The simplest usage is to prepare a single file containing the data in question using tar or zip, and then to encrypt that:
$ gpg -c bigfile.tar gpg: gpg-agent is not available in this session Enter passphrase: Repeat passphrase:$ ls bigfile.tar* bigfile.tar bigfile.tar.gpg
At this point, “bigfile.tar.gpg” is the encrypted file which is safe to transfer by email, FTP, or any other non-encrypted method. Note that the passphrase is not displayed while it is being entered; and that the encrypted file is typically smaller than the original due to compression in the encryption process. However it is necessary to have enough disk space to contain both the original and the encrypted data simultaneously, which may make this approach unsuitable for very large (TB) datasets.
The passphrase should be chosen with the same care as a computer login password. The Linux utility “pwgen” produces a selection of random passwords which may be useful in selecting a suitable passphrase.
The recipient will decrypt the file in a similar way:$ gpg bigfile.tar.gpg gpg: CAST5 encrypted data gpg: gpg-agent is not available in this session Enter passphrase: gpg: encrypted with 1 passphrase gpg: WARNING: message was not integrity protected
Note that if the passphrase is lost then it is vanishingly unlikely that the encrypted data can be recovered. Unless the passphrase is easily guessable, the encryption is sufficiently strong as to defeat most attempts to break it.
Written by Dr David Holland (WTSI), adapted by Manuel Corpas. Posted with Dr Holland’s permission.