Code and Life

Programming, electronics and other cool tech stuff

Supported by

Supported by Picotech

Python on-the-fly AES encryption/decryption and transfer to AWS S3

So, I started writing a file database and toolset called fileson to take advantage of AWS S3 Glacier Deep Archive (let's just call it GDA from now on). With 1 €/mo/TB storage cost, it is essentially a dirt cheap option to store very infrequently accessed data like offsite backups.

Why not just use rclone? Well, I disliked the fact that all tools do a ton of (paid) queries against S3 when syncing. I thought a simple JSON file database should work to keep track what to copy and delete. Well, that work is progressing, but as a part of that...

Encrypting on the fly with Python and Pycrypto(dome)

I started thinking that client side encryption would be useful as well. AES is tried and tested, and it's easy to find sample code to do it. But it seems wasteful to first create encrypted files on your hard drive, then upload them to AWS and finally delete everything.

Luckily, the Python AWS SDK boto3 has a great example on how to upload a file to S3 with upload_fileobj that accepts "a readable file-like object". What does that mean? Let's find out!

(note that you need to have the boto3 and pycryptodome libraries installed to successfully run these examples)

#!/usr/bin/env python3
import hashlib, os, boto3

class FileLike:
    def __init__(self, filename, mode):
        self.fp = open(filename, mode)
    def write(self, data):
        print('write', len(data), 'bytes')
        return self.fp.write(data)
    def read(self, size=-1):
        print('read', size, 'bytes')
        return self.fp.read(size)
    def tell(self):
        print('tell =', self.fp.tell())
        return self.fp.tell()
    def seek(self, offset, whence=0):
        print('seek', offset, whence)
        self.fp.seek(offset, whence)
    def close(self):
        print('close')
        self.fp.close()

s3 = boto3.client('s3')

fp = FileLike('hash.py', 'rb')
print('Uploading...')
s3.upload_fileobj(fp, 'mybucket', 'please_remove.txt')
print('Done...')

The FileLike class is a dummy wrapper around basic file functions that prints out what is happening when s3.upload_fileobj uses the provided object.

user@server:~$ ./s3_test.py
Uploading...
seek 0 1
tell = 0
seek 0 2
tell = 357
seek 0 0
tell = 0
tell = 0
read 357 bytes
read 0 bytes
seek 0 0
read 357 bytes
read 0 bytes
close
Done...

So what happens? upload_fileobj seems to:

  1. Seek the current position with fp.seek(offset=0, whence=1).
  2. Call tell, most likely to see where in the file it was when called.
  3. Seek to the end of the file with fp.seek(offset=0, whence=2).
  4. Again call tell to know where the file ends.
  5. Seek back to where it started,
  6. Make sure it's where it started with tell.
  7. Read the rest of the data (probably very large files would be read in chunks).
  8. Go back to beginning and read the data again -- most likely first round is checksum
  9. Close the file (surprising, as it is not opening it...)

This tells us exactly what is the minimum needed to implement an "on the fly" AES encoding file object:

  • read function that takes number of bytes
  • tell function that can return 0 at the beginning and filesize in the end, possibly intermediate values after some reads
  • seek function that can go to beginning (0,0), nowhere (0,1) and end (0,2)

Now I chose to do a little complicating twist: When on-the-fly encrypting with AES CTR (which I chose to avoid padding), one needs to store the randomized initial value (usual shorthand for this is iv) of the counter somewhere. With a 128 bit counter, this is 16 bytes. Usually this is stored in the beginning of the encrypted file. Now it means that first 16 bytes "read" from my wrapper should return the iv, and then start returning encrypted data. Also, the tell function in the end should return a length 16 bytes longer than the file being encrypted to accommodate. Doable, but as we are not 100 % sure the first read will not be something like 10 bytes (leaving 6 more bytes of iv to return), we need to do some conditionals in the read function.

Also, when you are seeking back to start, you need to reset the AES encryption, as boto3 does two passes on the upload (presumably for checksumming). Here's the final wrapper (with "write" support as well to support on-the-fly decryption when downloading from S3 and writing to disk):

from Crypto.Cipher import AES
from Crypto.Util import Counter
import hashlib, os

class AESFile:
    """On-the-fly AES encryption (on read) and decryption (on write).
    When reading, returns 16 bytes of iv first, then encrypted payload.
    On writing, first 16 bytes are assumed to contain the iv.
    Does the bare minimum, you may get errors if not careful."""
    @staticmethod
    def key(passStr, saltStr, iterations=100000):
        return hashlib.pbkdf2_hmac('sha256', passStr.encode('utf8'),
            saltStr.encode('utf8'), iterations)

    def initAES(self):
        self.obj = AES.new(self.key, AES.MODE_CTR, counter=Counter.new(
            128, initial_value=int.from_bytes(self.iv, byteorder='big')))

    def __init__(self, filename, mode, key, iv=None):
        if not mode in ('wb', 'rb'): 
            raise RuntimeError('Only rb and wb modes supported!')

        self.pos = 0
        self.key = key
        self.mode = mode
        self.fp = open(filename, mode)

        if mode == 'rb':
            self.iv = iv or os.urandom(16)
            self.initAES()
        else: self.iv = bytearray(16)

    def write(self, data):
        datalen = len(data)
        if self.pos < 16:
            ivlen = min(16-self.pos, datalen)
            self.iv[self.pos:self.pos+ivlen] = data[:ivlen]
            self.pos += ivlen
            if self.pos == 16: self.initAES() # ready to init now
            data = data[ivlen:]
        if data: self.pos += self.fp.write(self.obj.decrypt(data))
        return datalen

    def read(self, size=-1):
        ivpart = b''
        if self.pos < 16:
            if size == -1: ivpart = self.iv
            else:
                ivpart = self.iv[self.pos:min(16, self.pos+size)]
                size -= len(ivpart)
        enpart = self.obj.encrypt(self.fp.read(size)) if size else b''
        self.pos += len(ivpart) + len(enpart)
        return ivpart + enpart

    def tell(self): return self.pos

    # only in read mode (encrypting)
    def seek(self, offset, whence=0): # enough seek to satisfy AWS boto3
        if offset: raise RuntimeError('Only seek(0, whence) supported')

        self.fp.seek(offset, whence) # offset=0 works for all whences
        if whence==0: # absolute positioning, offset=0
            self.pos = 0
            self.initAES()
        elif whence==2: # relative to file end, offset=0
            self.pos = 16 + self.fp.tell()

    def close(self): self.fp.close()

Using the wrapper locally is trivial, just replace the normal fp = open(filename, 'rb') with fp = AESFile(filename, 'rb', key) (you can generate a 16 byte key yourself or use the AESFile.key to get proper PBKDF2 derived key from password and salt). Reading from that file pointer will give you first the iv and then contents of filename encrypted.

To decrypt, you replace rb with wb and write the encrypted data, and the wrapper writes decrypted data into the chosen file. I've provided a complete encryption/decryption utility with the source file in fileson crypt.py

Wrapping it up into AWS S3

Armed with the above class, it becomes trivial to adapt the boto3 AWS S3 examples to encrypt on the fly during upload, and decrypt on the fly during download. Note that you need to configure boto3 properly before running the code below, so follow the SDK docs first and only do this after you've successfully ran their example without encryption.

#!/usr/bin/env python3
import boto3

from crypt import AESFile

import argparse, time

parser = argparse.ArgumentParser(description='AWS S3 upload/download with on-the-fly encryption')
parser.add_argument('mode', type=str, choices=['upload','download'], help='Mode')
parser.add_argument('bucket', type=str, help='S3 bucket')
parser.add_argument('input', type=str, help='Input file or S3 object name')
parser.add_argument('output', type=str, help='Output file or S3 object name')
parser.add_argument('password', type=str, help='Password')
parser.add_argument('salt', type=str, help='Salt')
parser.add_argument('-i', '--iterations', type=int, default=100000,
        help='PBKDF2 iterations (default 100000)')
args = parser.parse_args()

s3 = boto3.client('s3')

key = AESFile.key(args.password, args.salt, args.iterations)

if args.mode == 'upload':
    fp = AESFile(args.input, 'rb', key)
    s3.upload_fileobj(fp, args.bucket, args.output)
else:
    fp = AESFile(args.output, 'wb', key)
    s3.download_fileobj(args.bucket, args.input, fp)

fp.close()

Super cool. You need to have a file and a bucket, but armed with those, let's try it out (writing the script itself to a folder called test in S3):

user@server:~$ ./aws.py upload mybucket aws.py test/aws.bin password salt
user@server:~$ ./aws.py download mybucket test/aws.bin aws2.py password salt
user@server:~$ diff aws.py aws2.py

If all went perfectly, diff should find your files identical. You can download the test/aws.bin yourself to view the encrypted version.

Awesome. You can now store encrypted stuff to AWS at will.