AWSFraudDetector

August 2, 2020 - 12 minutes read - 2459 words

Amazon Fraud Detector is a fully managed service that helps customers to identify potentially fraudulent online activities such as the creation of fake accounts, and online payment fraud. It leverages machine learning (ML), historical user data, and the years of fraud detection expertise from Amazon to train and build a model to automatically identify these threats. Announced at re:Invent in 2019, it is now generally available

You do not need to have machine learning experience to use this service. From a high level, the concept is to upload data to train, test, and deploy a fraud-detection model. Once the model is deployed, it can be used as an endpoint that generates fraud predictions to be consumed by your application.

The inputs for the model are elements like IP address, billing address. It then combines those with the fraud patterns seen at Amazon as input features to your model using a gradient tree boosting algorithm. The model that is built is a supervised machine learning model, which requires the historical data to be labeled with fraudulent or legitimate. The trained model will be able to detect a variety of online fraud including :

New account fraud Online payment fraud Guest checkout fraud Fake reviews abuse Each prediction request to the service will return a score and you can set up the subsequent action(s) based on the score. For example: Having the customer to do an additional verification

You can access the service via the management console or through the API. We will use the API for this post and we are going to take a close look at

The overall workflow Deep dive into some of the features and some of the limitations of the service Security practice Pricing Setting it up is quite easy: The workflow of building a model, and generating fraud prediction include:

Gather historical event data (i.e. Customer purchase history) from your application/platform Upload historical event data to AWS S3 Define the event that you want to evaluate for fraud Create decision rules (Condition for Amazon Fraud Detector to interpret input variable values and determine outcome during a fraud prediction) Create and Train a model Validate the model using a subset of the uploaded data Once you are happy with the model, set the model version status as ACTIVE Create a detector (Contains the detection logic to evaluate an event) with the model, model version and decision rules created earlier on as the inputs Start sending events to Amazon Fraud Detector and get a fraud prediction result Let’s test it out Step 1. Upload testing data In this example, we will train a model using ip_address and email address as training data. Once the model is trained, we will try to send in some arbitrary IP address and email address to get a prediction of whether it is likely a fraudulent or legit request

We need to upload the dataset file to an AWS S3 bucket in order to be consumed by the service

In terms of the dataset file, a couple of things to watch out for :

Dataset file must be in CSV format (Fraud detector only accept CSV file format) Each CSV file must contain two columns: EVENT_TIMESTAMP and EVENT_LABEL As a result, the dataset file that I used for this example has 4 columns:

IP_Address (ie. 10.0.0.1) Email Address (ie. testingBen@example.com) EVENT_TIMESTAMP is in ISO8601 format(ie. 2020-07-30T18:00:00Z) EVENT_LABEL column record contains the word “Fraud” and “Legit” ** S3 bucket must be in the AWS region where Amazon Fraud Detector is currently available

No alt text provided for this image Step 2. Setting up the Python environment I created a cloud9 environment and will be using the AWS SDK for Python (Boto3) to test out the service. If you do not want to go through the installation of Python and Boto3 on your local machine, I would suggest to spin up a cloud9 environment which already has Python + Boto3 installed. It literally took me 5 minutes to get started and make my first API call

The size of the cloud9 instance does not matter because we are not running any training job on this instance, we are only sending API calls to the Fraud detector service, I used the smallest instance size for my example

No alt text provided for this image I created a testing.py script with code listed below. This is actually making a call to the service to list the detectors that I have created. Now, of course, it will return empty as I have not created anything yet. But this is just a test to make sure that I am able to use the SDK to call the service

import boto3 fraudDetector = boto3.client(‘frauddetector’)

response = fraudDetector.get_detectors() print(response)

I executed the code in the Cloud9 terminal. It worked! I got a response with an empty list of detectors (‘detectors’:[]) without error. Let’s continue

ec2-user:~/environment/testpy $ python testing.py {‘detectors’: [], ‘ResponseMetadata’: {‘RequestId’: ‘response-id’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘content-type’: ‘application/x-amz-json-1.1’, ‘date’: ‘Thu, 30 Jul 2020 08:21:15 GMT’, ‘x-amzn-requestid’: ‘requestid’, ‘content-length’: ‘16’, ‘connection’: ‘keep-alive’}, ‘RetryAttempts’: 0}}

Step 3. Start creating elements: variables, entity, and label Now that we have the environment setup, we can start making API calls. We need to create a couple of elements that will become the input to the subsequent calls.

Variables - it represents data elements that you want to use in a fraud prediction. Currently, there are a number of supported variable type, to find out the full list check go to Amazon Fraud Detector documentation. In this example, the variables are the IP addresses and email addresses.

Entity - it represents who is performing the event for example merchant, customer, or account

Label - it classifies an event as fraudulent or legitimate

import boto3 fraudDetector = boto3.client(‘frauddetector’)

#Create variable email_address fraudDetector.create_variable( name='email_address’, variableType='EMAIL_ADDRESS’, dataSource='EVENT’, dataType='STRING’, defaultValue='’ )

Create variable ip_address

fraudDetector.create_variable( name='ip_address’, variableType='IP_ADDRESS’, dataSource='EVENT’, dataType='STRING’, defaultValue='’ )

fraudDetector.put_entity_type( name='sample_customer’, description='testing customer entity type’ )

fraudDetector.put_label( name='fraud’, description='label for fraud events’ )

fraudDetector.put_label( name='legit’, description='label for legitimate events’ )

Step 4. Create event types Moving onto creating an event, it defines the structure for an individual event. For this step, we need to use the variables, labels, and entities that we created earlier as the input for this. Once it is defined, you can build models and detectors that evaluate the risk for specific event types

fraudDetector.put_event_type( name='testing_registration’, eventVariables=[‘ip_address’, ‘email_address’], labels=[‘legit’, ‘fraud’], entityTypes=[‘sample_customer’])

Step 5. Create a Model We will now create and train a model. From the SDK perspective, we need to call

Create_model - This is a container for your model versions Create_model_version - This is where you specify the training data, the variables contained in the dataset, and the labels of fraud or legit. update_model_version_status - This will change the status of the model (ACTIVE or INACTIVE) and this is allowing the service to know which model will be used when a prediction comes in fraudDetector.create_model( modelId='testing_fraud_detection_model’, eventTypeName='testing_registration’, modelType='ONLINE_FRAUD_INSIGHTS’)

fraudDetector.create_model_version( modelId='testing_fraud_detection_model’, modelType='ONLINE_FRAUD_INSIGHTS’, trainingDataSource='EXTERNAL_EVENTS’, trainingDataSchema={ ‘modelVariables’: [‘ip_address’, ‘email_address’], ‘labelSchema’: { ‘labelMapper’: { ‘FRAUD’: [‘fraud’], ‘LEGIT’: [‘legit’] } } }, externalEventsDetail={ ‘dataLocation’: ‘s3://yours3bucket/datasetfileuploaded.csv’, ‘dataAccessRoleArn’: ‘role_arn’ } )

fraudDetector.update_model_version_status( modelId='testing_fraud_detection_model’, modelType='ONLINE_FRAUD_INSIGHTS’, modelVersionNumber='1.00’, status='ACTIVE’ ) This can take up a long time, depends on the amount of data that you have. Once the training is completed, you can review the model scores and performance using DescribeModelVersions. Based on the score returned, and other values such as precision, threshold, auc you can then decide how to segment the prediction score which is a score between 0 - 1000 (For example: If the score is greater than 900, it will be treated as a high-risk request)

Step 6: We got the model, now what? From a high level, this is the detector creation step and it contains these steps: Create detector -> Create outcomes -> Create rules -> Create detector version using rules and model as input-> Update the detector as ACTIVE

The reason why we are going back and forth with the detector and rules is that it gives a better understanding of the dependency of each component

Detector - It acts as a container for your detector versions

Outcome - Result of a fraud prediction. You can define it as risk levels or actions to be taken. For example (high, medium, low risk or verify, approve, review)

Rule - A condition for Amazon Fraud Detector to interpret input variable values, logic expression, and determine outcome during a fraud prediction. This requires an outcome to be created that’s why we perform the create outcome action before this step

Detector version - This defines specific models and rules that will be used to run as part of the Get prediction request.

Detector status - Similar to the model status, once defined as ‘ACTIVE’, it is the default detector if no detector id is specified in the prediction request

fraudDetector.put_detector( detectorId='testing_detector’, eventTypeName='testing_registration’ )

fraudDetector.put_outcome( name='need_verification’, description='this outcome initiates a verification workflow’ )

fraudDetector.put_outcome( name='staff_review’, description='this outcome sidelines event for review’ )

fraudDetector.put_outcome( name='approved’, description='this outcome approves the event’ ) fraudDetector.create_rule( ruleId='high_fraud_risk’, detectorId='testing_detector’, expression='$testing_fraud_detection_model_insightscore > 900’, language='DETECTORPL’, outcomes=[‘need_verification’] )

fraudDetector.create_rule( ruleId='medium_fraud_risk’, detectorId='testing_detector’, expression='$testing_fraud_detection_model_insightscore <= 900 and $testing_fraud_detection_model_insightscore > 700’, language='DETECTORPL’, outcomes=[‘staff_review’] )

fraudDetector.create_rule( ruleId='low_fraud_risk’, detectorId='testing_detector’, expression='$testing_fraud_detection_model_insightscore <= 700’, language='DETECTORPL’, outcomes=[‘approved’] )

fraudDetector.create_detector_version( detectorId='testing_detector’, rules=[{ ‘detectorId’: ‘testing_detector’, ‘ruleId’: ‘high_fraud_risk’, ‘ruleVersion’: ‘1’ }, { ‘detectorId’: ‘testing_detector’, ‘ruleId’: ‘medium_fraud_risk’, ‘ruleVersion’: ‘1’ }, { ‘detectorId’: ‘testing_detector’, ‘ruleId’: ‘low_fraud_risk’, ‘ruleVersion’: ‘1’ } ], modelVersions=[{ ‘modelId’: ‘testing_fraud_detection_model’, ‘modelType’: ‘ONLINE_FRAUD_INSIGHTS’, ‘modelVersionNumber’: ‘1.00’ }], ruleExecutionMode='FIRST_MATCHED’ ) fraudDetector.update_detector_version_status( detectorId='sample_detector’, detectorVersionId='1’, status='ACTIVE’ ) Step 7 Now everything is set up, let’s try to make a prediction! In our production environment, I would imagine the workflow would be something like this:

The customer press “Place Order” Your application makes a call to the endpoint / API The service returns the result with the outcome that we defined earlier on Your application will react to the returned result. For example (If the returned result is “need_verification”, your application may prompt additional forms for the customer to verify or hold the transaction until an agent approve the order Inputs to call our get prediction call :

DetectorId EventId EventTypeName EventTimestamp Entities And Most importantly the eventVariables(data that we are trying to distinguish whether it is fraudulent or not) fraudDetector.get_event_prediction( detectorId='testing_detector’, eventId='your event id’, eventTypeName='testing_registration’, eventTimestamp='current timestamp utc ISO8601 format’, entities=[{‘entityType’: ‘testing_customer’, ‘entityId’: ‘12345’}], eventVariables={ ‘email_address’: ‘johndoe@example.com’, ‘ip_address’: ‘9.8.7.6’ } ) Something to consider, SECURITY! Amazon Fraud Detector is similar to other AWS managed services that conform to the AWS shared responsibility model. AWS will take care of the security over the data that is hosted in this service, the user will need to be responsible for any personal data that they put in the AWS Cloud.

Couple things that are worth calling out for

Use multi-factor authentication(MFA) with each account Ensure data-in-transit encryption (TLS/SSL) Ensure data at rest encryption (Usage of Symmetric and Asymmetric Keys using AWS KMS) Turning on activity login with CloudTrail to monitor every action (API operation) Use CloudWatch Metrics to create alarms. Metrics such as the number of predictions performed, the latency of the prediction Mask your data or try to avoid putting identifying information such as Name into the data set Use VPC endpoints (AWS PrivateLink), create an interface VPC endpoint for your Amazon Fraud Detector which establish a private connection between your VPC and Amazon Fraud Detector so that your request(s) will not be traversing the internet Use IAM role or group to access the service If you do not want your data to be used for training purpose, you can opt-out of using your data for service improvement

How much is it? This service offers a free trial. The first two months after sign-up. You will be allotted

50 compute hours of free model training 500 compute hours of hosting the model 30,000 real-time Online Fraud Insight predictions and 30,000 real-time rules-based fraud predictions per month By default, it is using an instance with 8vCPU and 32 GiB memory. However, the service will choose the most efficient instance type to train the data so you may end up training a model in 1 hour but the number of billed hours is 2 hours due to the reason that an instance with higher specs is used to perform the training

After the free tier, the pricing is :

Model Training: $0.39 per hour Model Hosting: $0.06 per hour $0.030 per prediction for real-time online fraud insights for first 400,000 predictions per month $0.015 per prediction for real-time online fraud insights for Next 800,000 predictions per month $0.0075 per prediction for real-time online fraud insights for over 1,200,000 predictions per month An example is taken from Amazon’s pricing site:

Example 1: Real-time online fraud detection for an eCommerce merchant 10 Compute hours, 1 model hosted, 1000 predictions per day

The bill for the month for using Amazon Amazon Fraud Detector will be:

Training charge = 10 compute hours x 2 trainings x $0.39 per compute hour = $7.80

Hosting charge = 30 days x 24 hours x 1 model x $0.06 per compute hour = $43.20

Fraud prediction charge (real-time) = 1,000 predictions / day x 30 days x $0.03 per Online Fraud Insights prediction = $900

Total cost = $7.80 + $43.20 + $900 = $951

Wrap up AWS Fraud Detector makes it easy for companies, who did not have a fraud detection system in place to get started and set up their own fraud detection system. I was able to set up my testing prediction model with about 160 lines of codes. So yes! Definitely give it a try! From the API level (Cloud9 or setup a Juypter notebook) or from the management console.

Setting up a proper pipeline to retrain your model in production will bring great benefits. As your application/platform continues to perform more transactions, more data is collected which can help to train a model that will give better prediction going forward. You can set up periodic hourly, daily, weekly, monthly, or yearly task to perform the training of the model. Each model has its own status which makes the “transition” of your production model to become a seamless task

Some key takeaways from my experiment of this service

Overall the service is quite easy to use but there is something to watch out for when generating the dataset file Input dataset files will need to be in .CSV format When collecting the data it is important to ensure the timestamp is captured Labeling will need to be done on the dataset Overall, some data cleansing/transformation will be needed in order for the model to be trained It has the feature to allow the user to use their own traded model in Sagemaker as the prediction model You will need to have some experience with AWS S3 and IAM in order to run this