AWS Lambda is a service that allows you to run code without provisioning a server. This has some interesting possibilities especially when processing data asynchronously. When I first started learning about Lambda most of the examples were about resizing images. I work with PCAP files on a daily basis and have used scapy for several years so thought it would be a good experiment to use Lambda to do some simple PCAP inspection.
It will work something like this:
- A PCAP file is uploaded to a specific S3 bucket.
- The bucket will trigger an event and notify the Lambda function.
- The Lambda function will load the PCAP file from S3 and extract MAC addresses.
- The manufacturer of the network interface will be looked up using an external API.
We'll be keeping track of the technical debt accumulated and address it after we have a working proof of concept. If you want to put a more positive spin on it, you can call it opportunities for improvement. Let's get started.
It's always a good idea to set up a virtual environment when starting a Python project, no matter how small. Let's set up a virtual environment making sure to use Python 2.7 and install scapy
$ mkdir inspect-pcap-lambda && cd inspect-pcap-lambda
$ virtualenv --python=python2.7 env
$ source env/bin/activate
$ pip freeze > pre_scapy.txt
$ pip install scapy
$ pip freeze > post_scapy.txt
You'll see why I captured the output of pip freeze
both before and after I installed scapy
later on in the post. scapy
has a very useful REPL interface that can be loaded by simply running:
$ scapy
It's a great way to learn about the tool and I recommend exploring it further. We'll be writing a normal python script so scapy
will be imported. To make things easier, we'll structure our script to match what is expected by AWS Lambda:
from __future__ import print_function
import scapy
def handler(event, context):
pass
The handler
function is the entrypoint for the Lambda function. It can be called anything you want as the module will be imported. We'll call this file inspect_pcap.py
so when running our Lambda function inspect_pcap.handler(event, context)
will be called.
Let's start using scapy
to inspect a PCAP file. There are lots of places where you can download PCAP files but you can easily create your own. Make sure all other applications are closed to keep from cluttering up the capture and start tcpdump
in one terminal:
$ sudo sudo tcpdump -i enp0s3 -s 1514 -w wget_google.pcap
In another terminal make a simple HTTP request to Google:
$ wget http://www.google.com
In the first terminal press Ctrl-C
to stop the capture. It should output the number of packets capture. My capture was 29 and it shouldn't be too much more than that. Take a look at your capture by reading it:
$ tcpdump -r wget_google.pcap
reading from file wget_google.pcap, link-type EN10MB (Ethernet)
...
13:24:16.047269 IP 10.1.1.34.42683 > 10.1.1.1.domain: 63805+ A? www.google.com. (32)
13:24:16.066879 IP 10.1.1.1.domain > 10.1.1.34.42683: 63805 1/0/0 A 172.217.7.164 (48)
...
The packets shown represent a host at 10.1.1.34
performing a DNS lookup for www.google.com
and a DNS server at 10.1.1.1
returning the IP address.
Ok, back to scapy
. Let's add some code that opens the PCAP and prints a summary of each packet to the console. We need a way to actually invoke the handler
function so we'll add that as well:
from __future__ import print_function
from scapy.all import rdpcap
def handler(event, context):
# Load PCAP file
pcap = rdpcap('wget_google.pcap')
# Iterate over each packet in the PCAP file
for pkt in pcap:
print(pkt.summary())
if __name__ == '__main__':
handler(event=None, context=None)
Next run the script:
$ python inspect_pcap_local.py
...
Ether / IP / UDP / DNS Qry "www.google.com."
Ether / IP / UDP / DNS Ans "172.217.7.164"
...
Technical Debt Item 1 - write unit tests and configure CI
Ether
, IP
and UDP
represent the layers of the packets. See Internet protocol suite for more information about what each of these of layers does and how they work together. We want to take a closer look at the Ether
layer for now, specifically the source and destination MAC addresses. Let's look up the manufacturer of each MAC address based on the first 24 bits, called the Organizationally Unique Identifier or OUI. Manufacturers of network interfaces are assigned one or more of OUIs by the IEEE that are universally unique. This means that every network interface has a completely unique MAC address. This is a simplification, but for our purposes it'll work. The MAC address of the (VirtualBox) host at 10.1.1.34
is 08:00:27:71:bc:15
. The OUI is the first 24 bits, or 08:00:27
which according to the Wireshark - OUI Lookup page is assigned to "PCS Systemtechnik GmbH".
To extract the MAC addresses from each packet, we need to look at the Ether
layer.
from __future__ import print_function
from scapy.all import rdpcap, Ether
def handler(event, context):
# Load PCAP file
pcap = rdpcap('wget_google.pcap')
# Iterate over each packet in the PCAP file
for pkt in pcap:
# Get the source and destination MAC addresses
src_mac = pkt.getlayer(Ether).src
dst_mac = pkt.getlayer(Ether).dst
print('src_mac = {} dst_mac = {}'.format(src_mac, dst_mac))
if __name__ == '__main__':
handler(event=None, context=None)
This is not robust code. If the packet doesn't have an Ether
layer this will raise an unhandled exception.
Technical Debt Item 2 - add proper exception handling
$ python inspect_pcap_local.py
src_mac = 74:d0:2b:93:fb:12 dst_mac = ff:ff:ff:ff:ff:ff
...
src_mac = 08:00:27:71:bc:15 dst_mac = 2c:30:33:e9:3c:a3
...
We can see the source and destination MAC addresses of each packet in the PCAP. Next, we need to put these values into a set
so we can query the external API. A set
is a better choice than a list
because many values will be repeated and we don't want to query the external API any more than necessary.
Sending a few extra queries may not seem like a big deal, but we are using a public cloud where everything has a cost. Also, we want to be good Internet citizens and not unduly burden a free API.
Let's finish up the PCAP inspection part:
from __future__ import print_function
from scapy.all import rdpcap, Ether
def handler(event, context):
# Load PCAP file
pcap = rdpcap('wget_google.pcap')
macs = set()
# Iterate over each packet in the PCAP file
for pkt in pcap:
# Get the source and destination MAC addresses
src_mac = pkt.getlayer(Ether).src
dst_mac = pkt.getlayer(Ether).dst
# Add them to the set of MAC addresses
macs.add(src_mac)
macs.add(dst_mac)
print('Found {} MAC addresses'.format(len(macs)))
if __name__ == '__main__':
handler(event=None, context=None)
Running the script should show something similar to this:
$ python inspect_pcap_local.py
Found 9 MAC addresses
Now we need to query the external API for each MAC address. I'm a big fan of requests but our use case is very simple so we can rely on the builtin urllib2
module. We'll use the API provided by MAC Vendors which can be queried by appending the MAC address to the end of the http://api.macvendors.com/
URL. Using urllib2
, we can perform a lookup like this:
>>> import urllib2
>>> response = urllib2.urlopen('http://api.macvendors.com/08:00:27:71:bc:15')
>>> print(response.getcode())
200
>>> print(response.readline())
PCS Systemtechnik GmbH
The response code should be checked to make sure the lookup was successful. It will be 200
if it was found or 404
if it was not found. urllib2
will raise an exception
that should be handled. All output that would normally go to stdout
will be written to CloudWatch and we don't want the logs getting cluttered up with tracebacks when we know that some queries will fail. After adding in the call to the external API, our function should look like this:
from __future__ import print_function
from scapy.all import rdpcap, Ether
import urllib2
def handler(event, context):
# Load PCAP file
pcap = rdpcap('wget_google.pcap')
mac_addresses = set()
# Iterate over each packet in the PCAP file
for pkt in pcap:
# Get the source and destination MAC addresses
src_mac = pkt.getlayer(Ether).src
dst_mac = pkt.getlayer(Ether).dst
# Add them to the set of MAC addresses
mac_addresses.add(src_mac)
mac_addresses.add(dst_mac)
print('Found {} MAC addresses'.format(len(mac_addresses)))
# Iterate over the set() of MAC addresses
for mac in mac_addresses:
# Attempt to look up the manufacturer
try:
resp = urllib2.urlopen('http://api.macvendors.com/{}'.format(mac))
if resp.getcode() == 200:
vendor_str = resp.readline()
print('{} is a {} network interface'.format(mac, vendor_str))
# Handle not found queries
except urllib2.HTTPError:
print('The manufacturer for {} was not found'.format(mac))
continue
if __name__ == '__main__':
handler(event=None, context=None)
Technical Debt Item 3 - make external API calls in parallel rather than serially
To list the manufacturers of the network interfaces represented by the PCAP run the script:
$ python inspect_pcap_local.py
Found 9 MAC addresses
...
08:00:27:71:bc:15 is a PCS Systemtechnik GmbH network interface
...
The manufacturer for 33:33:00:00:00:fb was not found
The manufacturer for ff:ff:ff:ff:ff:ff was not found
...
There's our VirtualBox host but what about the ones that were not found? They're actually "special" MAC addresses that don't belong to a host. 33:33:00:00:00:fb
is for IPv6-Neighbor-Discovery and ff:ff:ff:ff:ff:ff
is the broadcast MAC address.
As an aside, only the first 24 bits of the MAC address are needed to look up the manufacturer. These MAC addresses:
f8:bc:12:53:0b:da
f8:bc:12:53:0b:db
Are unique and represent two separate network interfaces, but they are both manufactured by Dell Inc. As a result we only need to lookup up f8:bc:12
once.
Technical Debt Item 4 - lookup each OUI once
The next step is to load a PCAP file from S3. Python based Lambda functions have the boto3
module available implicitly but we'll include it explicitly so we can test locally. Downloading a file from S3 is very simple. The example below assumes that the user executing the code has properly configured ~/.aws/config
and ~/.aws/credentials
files.
>>> import boto3
>>> s3 = boto3.resource('s3')
>>> pcap_file = open('/tmp/temp.pcap', 'wb')
>>> s3.Object('uploaded-pcaps', 'wget_google.pcap').download_file(pcap_file.name)
>>> pcap_file.close()
Let's update our script to use a PCAP file from S3. This is when we start using the special event
Lambda function argument.
Technical Debt Item 5 - put an upper limit on the size of the PCAP being downloaded from S3
from __future__ import print_function
import json
import os
import urllib
import urllib2
import boto3
from scapy.all import rdpcap, Ether
def handler(event, context):
# Log the event
print('Received event: {}'.format(json.dumps(event)))
# Extract the bucket and key (from AWS 's3-get-object-python' example)
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
# Create a temporary file
pcap_file = open('/tmp/temp.pcap', 'wb')
# Download the PCAP from S3
s3 = boto3.resource('s3')
s3.Object(bucket, key).download_file(
pcap_file.name)
pcap_file.close()
except Exception:
print('Error getting object {} from the {} bucket'.format(key, bucket))
# Load PCAP file
pcap = rdpcap(pcap_file.name)
mac_addresses = set()
# Iterate over each packet in the PCAP file
for pkt in pcap:
# Get the source and destination MAC addresses
src_mac = pkt.getlayer(Ether).src
dst_mac = pkt.getlayer(Ether).dst
# Add them to the set of MAC addresses
mac_addresses.add(src_mac)
mac_addresses.add(dst_mac)
print('Found {} MAC addresses'.format(len(mac_addresses)))
# Iterate over the set() of MAC addresses
for mac in mac_addresses:
# Attempt to look up the manufacturer
try:
resp = urllib2.urlopen('http://api.macvendors.com/{}'.format(mac))
if resp.getcode() == 200:
vendor_str = resp.readline()
print('{} is a {} network interface'.format(mac, vendor_str))
# Handle not found queries
except urllib2.HTTPError:
print('The manufacturer for {} was not found'.format(mac))
continue
# Delete the temporary file
os.remove(pcap_file.name)
if __name__ == '__main__':
handler(event=None, context=None)
We're now ready to prepare the archive for upload. It's tempting to just zip up the whole project directory but that will include a lot of unnecessary data. Remember when we were first set up the project directory we captured the output of pip freeze
before and after we installed scapy
? Comparing these two files will tell us what modules need to be included in our .zip archive.
$ diff pre_scapy.txt post_scapy.txt
3a4
> scapy==2.3.3
Easy enough, there's just one package we need to include in addition to our script. Lambda won't know about our virtual environment so we can either set environmental variables or "flatten" the .zip archive. I'm sure there is a more elegant way to do this, but it'll do for now. We'll also want to exclude files that end in *.pyc
:
$ cd venv/lib/python2.7/site-packages/
$ zip -x "*.pyc" -r ../../../../inspect_pcap.zip scapy
$ cd ../../../../
$ zip -x "*.pyc" -r inspect_pcap.zip inspect_pcap.py
Technical Debt Item 6 - automate the creation of Lambda .ZIP archive
We're now ready to upload our Lambda function with the scapy
module included. Following the instructions here I created a Lambda function, an S3 trigger and role. The S3 trigger will send an event when an object is PUT
to a specific bucket. A Lambda function has some permissions implicitly included (like writing logs to CloudWatch) but we need to explicitly grant it read-only permissions to S3 using a Role. After everything is set up you should be able to see and edit your code. This is very helpful so you don't have to keep re-uploading the .zip archive. If you're going to be making a lot of changes using the UI I would recommend checking out versions. I configured the test event by clicking the "Actions" button then "Configure test event". The sample event template used was "S3 PUT" and I modified it for our use:
{
"Records": [
{
"eventVersion": "2.0",
"eventTime": "1970-01-01T00:00:00.000Z",
"requestParameters": {
"sourceIPAddress": "127.0.0.1"
},
"s3": {
"configurationId": "testConfigRule",
"object": {
"eTag": "0123456789abcdef0123456789abcdef",
"sequencer": "0A1B2C3D4E5F678901",
"key": "wget_google.pcap",
"size": 1024
},
"bucket": {
"arn": "arn:aws:s3:::mybucket",
"name": "uploaded-pcaps",
"ownerIdentity": {
"principalId": "EXAMPLE"
}
},
"s3SchemaVersion": "1.0"
},
"responseElements": {
"x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH",
"x-amz-request-id": "EXAMPLE123456789"
},
"awsRegion": "us-east-1",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "EXAMPLE"
},
"eventSource": "aws:s3"
}
]
}
Click the blue "Save and Test" button and in a few seconds you should see something like this:
To check the logs when the the function is put into production, you'll look at the CloudWatch log group:
We'll address our list of technical debt in the next post:
- write unit tests and configure CI
- add proper exception handling
- make external API calls in parallel rather than serially
- lookup each OUI once
- put an upper limit on the size of the PCAP being downloaded from S3
- automate the creation of Lambda .ZIP archive
Thanks for following along, please leave any comments or questions below.