Validating File Content Types to avoid Malicious File Hosting using ML Model
This post discusses the issue of malicious file uploads by content spoofing and how can we mitigate such vulnerabilities using AI/ML models.
Last updated
This post discusses the issue of malicious file uploads by content spoofing and how can we mitigate such vulnerabilities using AI/ML models.
Last updated
Most of the software/webapps/android applications requires file upload features which allows user to upload several file types among which most of them includes PDF, doc and images.
An attacker can exploit such feature to upload malicious by spoofing content types which can lead to XSS or distribute malicious payloads/malwares from your servers/buckets.
There can be several ways to solve this problem using magic numbers, libmagic or file utility.
Recently I stumbled upon a github repo from google team which used AI/ML model to predict file's content type. I wanted to give it a shot, which led to this weekend project. It's F1 score is far better than other tools and libraries available. It's using a model stored in onnix format packaged inside magika python library (package is also available for javascript but enforcing file validation check on frontend makes no sense as it can be easily bypassed).
So how can it help to secure file uploads?
Since attackers usually bypass weak validation checks by uploading malicious HTML/XSS payloads with file name your-malicious-file-name.pdf / your-malicious-file-name.png / your-malicious-file-name.jpg
Magika can help us to validate file content type instead of relying only on the name.
Below mentioned approach could be one of the approach for solving the vuln. Most of the companies will generate pre-signed post request for providing users a secure way to upload files to s3 bucket since this would reduce the load on the backend server. So let's discuss solution for validating file content type for uploaded object in the bucket.
Frontend will use pre-signed url for uploading object in the bucket
Once, the file has been uploaded, AWS S3 Event will generate a Notification for AWS Lambda
Lambda will be triggered which will get object from the bucket validate it's content type according to allowed bucket objects policy. If file content type is invalid then file will be deleted from the bucket.
Code Block for achieving above flow:
Github Repo Link: https://github.com/dmdhrumilmistry/file-validation
I love using containerized applications, So I'll deploy above lambda function in containerized manner. You're free to use your deployment strategy.
Let's write a Dockerfile for the lambda function
We'll be using AWS ECR private repo to store our docker image. So let's create a private registry in AWS ECR.
Once, we have private ECR repo configured, build and push container image to this repo.
Firstly, login to docker
Build amd64 arch image and push it to ECR repo
Note: Initially I tried to use arm64 arch lambda function but I was facing several issues running onnx on it. After research it seems like pytorch compatiblity issue due to missing sys dir files. So I chose to use amd64 arch based container image for lambda function.
Now, Let's create a IAM Role Policy for our lambda function so it can access, delete s3 bucket objects and create cloudwatch logs.
Create IAM policy using below JSON based policy
Let's create a lambda function for validating file content types.
Configure ECR image, IAM role policy, memory and timeout according to your needs.
Finally Let's create a event notification in our target S3 bucket from bucket properties tab which will be used to trigger lambda function
Test Lambda function by uploading files to s3 bucket. After uploading you should only be able to view valid content file types configured according in bucket policy inside file_validation.py
file.
Here, test.png and sample.jpg are valid files while test.csv file content type is not allowed.
Validating file content types is one of the crucial component for any engineering team to develop secure software. AI/ML models can help teams to validate file content types and avoid malicious uploads.
I find above malicious file upload meme hilarious π WBU?