Marc Kerins: Inspect PCAP Files Using AWS Lambda

AWS Lambda is a service that allows you to run code without provisioning a server. This has some interesting possibilities especially when processing data asynchronously. When I first started learning about Lambda most of the examples were about resizing images. I work with PCAP files on a daily basis and have used scapy for several years so thought it would be a good experiment to use Lambda to do some simple PCAP inspection.

It will work something like this:

A PCAP file is uploaded to a specific S3 bucket.
The bucket will trigger an event and notify the Lambda function.
The Lambda function will load the PCAP file from S3 and extract MAC addresses.
The manufacturer of the network interface will be looked up using an external API.

We'll be keeping track of the technical debt accumulated and address it after we have a working proof of concept. If you want to put a more positive spin on it, you can call it opportunities for improvement. Let's get started.

It's always a good idea to set up a virtual environment when starting a Python project, no matter how small. Let's set up a virtual environment making sure to use Python 2.7 and install scapy

$ mkdir inspect-pcap-lambda && cd inspect-pcap-lambda
$ virtualenv --python=python2.7 env
$ source env/bin/activate
$ pip freeze > pre_scapy.txt
$ pip install scapy
$ pip freeze > post_scapy.txt

You'll see why I captured the output of pip freeze both before and after I installed scapy later on in the post. scapy has a very useful REPL interface that can be loaded by simply running:

$ scapy

It's a great way to learn about the tool and I recommend exploring it further. We'll be writing a normal python script so scapy will be imported. To make things easier, we'll structure our script to match what is expected by AWS Lambda:

from __future__ import print_function

import scapy


def handler(event, context):  
    pass

The handler function is the entrypoint for the Lambda function. It can be called anything you want as the module will be imported. We'll call this file inspect_pcap.py so when running our Lambda function inspect_pcap.handler(event, context) will be called.

Let's start using scapy to inspect a PCAP file. There are lots of places where you can download PCAP files but you can easily create your own. Make sure all other applications are closed to keep from cluttering up the capture and start tcpdump in one terminal:

$ sudo sudo tcpdump -i enp0s3 -s 1514 -w wget_google.pcap

In another terminal make a simple HTTP request to Google:

$ wget http://www.google.com

In the first terminal press Ctrl-C to stop the capture. It should output the number of packets capture. My capture was 29 and it shouldn't be too much more than that. Take a look at your capture by reading it:

$ tcpdump -r wget_google.pcap
reading from file wget_google.pcap, link-type EN10MB (Ethernet)  
...
13:24:16.047269 IP 10.1.1.34.42683 > 10.1.1.1.domain: 63805+ A? www.google.com. (32)  
13:24:16.066879 IP 10.1.1.1.domain > 10.1.1.34.42683: 63805 1/0/0 A 172.217.7.164 (48)  
...

The packets shown represent a host at 10.1.1.34 performing a DNS lookup for www.google.com and a DNS server at 10.1.1.1 returning the IP address.

Ok, back to scapy. Let's add some code that opens the PCAP and prints a summary of each packet to the console. We need a way to actually invoke the handler function so we'll add that as well:

from __future__ import print_function

from scapy.all import rdpcap


def handler(event, context):  
    # Load PCAP file
    pcap = rdpcap('wget_google.pcap')

    # Iterate over each packet in the PCAP file
    for pkt in pcap:
        print(pkt.summary())

if __name__ == '__main__':  
    handler(event=None, context=None)

Next run the script:

$ python inspect_pcap_local.py
...
Ether / IP / UDP / DNS Qry "www.google.com."  
Ether / IP / UDP / DNS Ans "172.217.7.164"  
...

Technical Debt Item 1 - write unit tests and configure CI

Ether, IP and UDP represent the layers of the packets. See Internet protocol suite for more information about what each of these of layers does and how they work together. We want to take a closer look at the Ether layer for now, specifically the source and destination MAC addresses. Let's look up the manufacturer of each MAC address based on the first 24 bits, called the Organizationally Unique Identifier or OUI. Manufacturers of network interfaces are assigned one or more of OUIs by the IEEE that are universally unique. This means that every network interface has a completely unique MAC address. This is a simplification, but for our purposes it'll work. The MAC address of the (VirtualBox) host at 10.1.1.34 is 08:00:27:71:bc:15. The OUI is the first 24 bits, or 08:00:27 which according to the Wireshark - OUI Lookup page is assigned to "PCS Systemtechnik GmbH".

To extract the MAC addresses from each packet, we need to look at the Ether layer.

from __future__ import print_function

from scapy.all import rdpcap, Ether


def handler(event, context):  
    # Load PCAP file
    pcap = rdpcap('wget_google.pcap')

    # Iterate over each packet in the PCAP file
    for pkt in pcap:
        # Get the source and destination MAC addresses
        src_mac = pkt.getlayer(Ether).src
        dst_mac = pkt.getlayer(Ether).dst
        print('src_mac = {} dst_mac = {}'.format(src_mac, dst_mac))

if __name__ == '__main__':  
    handler(event=None, context=None)

This is not robust code. If the packet doesn't have an Ether layer this will raise an unhandled exception.

Technical Debt Item 2 - add proper exception handling

$ python inspect_pcap_local.py
src_mac = 74:d0:2b:93:fb:12 dst_mac = ff:ff:ff:ff:ff:ff  
...
src_mac = 08:00:27:71:bc:15 dst_mac = 2c:30:33:e9:3c:a3  
...

We can see the source and destination MAC addresses of each packet in the PCAP. Next, we need to put these values into a set so we can query the external API. A set is a better choice than a list because many values will be repeated and we don't want to query the external API any more than necessary.

Sending a few extra queries may not seem like a big deal, but we are using a public cloud where everything has a cost. Also, we want to be good Internet citizens and not unduly burden a free API.

Let's finish up the PCAP inspection part:

from __future__ import print_function

from scapy.all import rdpcap, Ether


def handler(event, context):  
    # Load PCAP file
    pcap = rdpcap('wget_google.pcap')

    macs = set()

    # Iterate over each packet in the PCAP file
    for pkt in pcap:
        # Get the source and destination MAC addresses
        src_mac = pkt.getlayer(Ether).src
        dst_mac = pkt.getlayer(Ether).dst
        # Add them to the set of MAC addresses
        macs.add(src_mac)
        macs.add(dst_mac)

    print('Found {} MAC addresses'.format(len(macs)))

if __name__ == '__main__':  
    handler(event=None, context=None)

Running the script should show something similar to this:

$ python inspect_pcap_local.py 
Found 9 MAC addresses

Now we need to query the external API for each MAC address. I'm a big fan of requests but our use case is very simple so we can rely on the builtin urllib2 module. We'll use the API provided by MAC Vendors which can be queried by appending the MAC address to the end of the http://api.macvendors.com/ URL. Using urllib2, we can perform a lookup like this:

>>> import urllib2
>>> response = urllib2.urlopen('http://api.macvendors.com/08:00:27:71:bc:15')
>>> print(response.getcode())
200  
>>> print(response.readline())
PCS Systemtechnik GmbH

The response code should be checked to make sure the lookup was successful. It will be 200 if it was found or 404 if it was not found. urllib2 will raise an exception that should be handled. All output that would normally go to stdout will be written to CloudWatch and we don't want the logs getting cluttered up with tracebacks when we know that some queries will fail. After adding in the call to the external API, our function should look like this:

from __future__ import print_function

from scapy.all import rdpcap, Ether  
import urllib2


def handler(event, context):  
    # Load PCAP file
    pcap = rdpcap('wget_google.pcap')

    mac_addresses = set()

    # Iterate over each packet in the PCAP file
    for pkt in pcap:
        # Get the source and destination MAC addresses
        src_mac = pkt.getlayer(Ether).src
        dst_mac = pkt.getlayer(Ether).dst
        # Add them to the set of MAC addresses
        mac_addresses.add(src_mac)
        mac_addresses.add(dst_mac)

    print('Found {} MAC addresses'.format(len(mac_addresses)))

    # Iterate over the set() of MAC addresses
    for mac in mac_addresses:
        # Attempt to look up the manufacturer
        try:
            resp = urllib2.urlopen('http://api.macvendors.com/{}'.format(mac))
            if resp.getcode() == 200:
                vendor_str = resp.readline()
                print('{} is a {} network interface'.format(mac, vendor_str))
        # Handle not found queries
        except urllib2.HTTPError:
            print('The manufacturer for {} was not found'.format(mac))
            continue

if __name__ == '__main__':  
    handler(event=None, context=None)

Technical Debt Item 3 - make external API calls in parallel rather than serially

To list the manufacturers of the network interfaces represented by the PCAP run the script:

$ python inspect_pcap_local.py 
Found 9 MAC addresses  
...
08:00:27:71:bc:15 is a PCS Systemtechnik GmbH network interface  
...
The manufacturer for 33:33:00:00:00:fb was not found  
The manufacturer for ff:ff:ff:ff:ff:ff was not found  
...

There's our VirtualBox host but what about the ones that were not found? They're actually "special" MAC addresses that don't belong to a host. 33:33:00:00:00:fb is for IPv6-Neighbor-Discovery and ff:ff:ff:ff:ff:ff is the broadcast MAC address.

As an aside, only the first 24 bits of the MAC address are needed to look up the manufacturer. These MAC addresses:

f8:bc:12:53:0b:da  
f8:bc:12:53:0b:db

Are unique and represent two separate network interfaces, but they are both manufactured by Dell Inc. As a result we only need to lookup up f8:bc:12 once.

Technical Debt Item 4 - lookup each OUI once

The next step is to load a PCAP file from S3. Python based Lambda functions have the boto3 module available implicitly but we'll include it explicitly so we can test locally. Downloading a file from S3 is very simple. The example below assumes that the user executing the code has properly configured ~/.aws/config and ~/.aws/credentials files.

>>> import boto3
>>> s3 = boto3.resource('s3')
>>> pcap_file = open('/tmp/temp.pcap', 'wb')
>>> s3.Object('uploaded-pcaps', 'wget_google.pcap').download_file(pcap_file.name)
>>> pcap_file.close()

Let's update our script to use a PCAP file from S3. This is when we start using the special event Lambda function argument.

Technical Debt Item 5 - put an upper limit on the size of the PCAP being downloaded from S3

from __future__ import print_function  
import json  
import os  
import urllib  
import urllib2

import boto3  
from scapy.all import rdpcap, Ether  


def handler(event, context):  
    # Log the event
    print('Received event: {}'.format(json.dumps(event)))
    # Extract the bucket and key (from AWS 's3-get-object-python' example)
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
    try:
        # Create a temporary file
        pcap_file = open('/tmp/temp.pcap', 'wb')

        # Download the PCAP from S3
        s3 = boto3.resource('s3')
        s3.Object(bucket, key).download_file(
            pcap_file.name)
        pcap_file.close()
    except Exception:
        print('Error getting object {} from the {} bucket'.format(key, bucket))

    # Load PCAP file
    pcap = rdpcap(pcap_file.name)

    mac_addresses = set()

    # Iterate over each packet in the PCAP file
    for pkt in pcap:
        # Get the source and destination MAC addresses
        src_mac = pkt.getlayer(Ether).src
        dst_mac = pkt.getlayer(Ether).dst
        # Add them to the set of MAC addresses
        mac_addresses.add(src_mac)
        mac_addresses.add(dst_mac)

    print('Found {} MAC addresses'.format(len(mac_addresses)))

    # Iterate over the set() of MAC addresses
    for mac in mac_addresses:
        # Attempt to look up the manufacturer
        try:
            resp = urllib2.urlopen('http://api.macvendors.com/{}'.format(mac))
            if resp.getcode() == 200:
                vendor_str = resp.readline()
                print('{} is a {} network interface'.format(mac, vendor_str))
        # Handle not found queries
        except urllib2.HTTPError:
            print('The manufacturer for {} was not found'.format(mac))
            continue

    # Delete the temporary file
    os.remove(pcap_file.name)

if __name__ == '__main__':  
    handler(event=None, context=None)

We're now ready to prepare the archive for upload. It's tempting to just zip up the whole project directory but that will include a lot of unnecessary data. Remember when we were first set up the project directory we captured the output of pip freeze before and after we installed scapy? Comparing these two files will tell us what modules need to be included in our .zip archive.

$ diff pre_scapy.txt post_scapy.txt 
3a4  
> scapy==2.3.3

Easy enough, there's just one package we need to include in addition to our script. Lambda won't know about our virtual environment so we can either set environmental variables or "flatten" the .zip archive. I'm sure there is a more elegant way to do this, but it'll do for now. We'll also want to exclude files that end in *.pyc:

$ cd venv/lib/python2.7/site-packages/
$ zip -x "*.pyc" -r ../../../../inspect_pcap.zip scapy
$ cd ../../../../
$ zip -x "*.pyc" -r inspect_pcap.zip inspect_pcap.py

Technical Debt Item 6 - automate the creation of Lambda .ZIP archive

We're now ready to upload our Lambda function with the scapy module included. Following the instructions here I created a Lambda function, an S3 trigger and role. The S3 trigger will send an event when an object is PUT to a specific bucket. A Lambda function has some permissions implicitly included (like writing logs to CloudWatch) but we need to explicitly grant it read-only permissions to S3 using a Role. After everything is set up you should be able to see and edit your code. This is very helpful so you don't have to keep re-uploading the .zip archive. If you're going to be making a lot of changes using the UI I would recommend checking out versions. I configured the test event by clicking the "Actions" button then "Configure test event". The sample event template used was "S3 PUT" and I modified it for our use:

{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventTime": "1970-01-01T00:00:00.000Z",
      "requestParameters": {
        "sourceIPAddress": "127.0.0.1"
      },
      "s3": {
        "configurationId": "testConfigRule",
        "object": {
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0A1B2C3D4E5F678901",
          "key": "wget_google.pcap",
          "size": 1024
        },
        "bucket": {
          "arn": "arn:aws:s3:::mybucket",
          "name": "uploaded-pcaps",
          "ownerIdentity": {
            "principalId": "EXAMPLE"
          }
        },
        "s3SchemaVersion": "1.0"
      },
      "responseElements": {
        "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH",
        "x-amz-request-id": "EXAMPLE123456789"
      },
      "awsRegion": "us-east-1",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "EXAMPLE"
      },
      "eventSource": "aws:s3"
    }
  ]
}

Click the blue "Save and Test" button and in a few seconds you should see something like this:
To check the logs when the the function is put into production, you'll look at the CloudWatch log group: