This article is a step-by-step guide for running Emcien on a publicly available data set related to distributed denial of service (DDOS). This is a public dataset and contains 1.75 million transactions with 27 variables. You can download a CSV version of the data here. DDOS attacks costs businesses millions of dollars each year and affect many online services. This data contains records of network protocols, packet sizes, and packet sequences. This article contains the steps to build a predictive model and deploy on the DDOS dataset.
To guide ourselves through these steps, we will follow the Emcien Steps which are used to solve most predictive problems:
This guide will use lab data provided by Mu’tah University and ResearchGate. This data contains 1.75 million transactions with 27 variables. You can download a CSV version of the data here. This data records network protocols, packet sizes, and packet sequences. Below is a glimpse of the distribution of packet types.
Our first decision point is what are we trying to predict. In this case we are trying to predict the packet_type. The data is not time series, The incoming traffic can come from any source IP at any time and any rate, this problem is not structured around time and thus not a time series problem. We can conclude this is a simple prediction/classification problem and move on to the next step.
Our first two steps are to bin the numeric data into predictive ranges and extract a hold-out set for testing. These two steps are easily done within Emcien using EmcienBandit.
From the Home Page, click the “Bandit” link at the top right corner.
Drag the file ddos.csv onto Bandit.
Select your Outcome Category and create your hold-out set.
At this point Emcien will process the source data, find the predictive ranges for all numeric data, and create a random hold-out set. Once Emcien is done, click the big green “Analyze” button.
Now we have two data sets. The first dataset will be used to build the model and extract predictive rules. The second dataset will be used to validate the model and test the quality of the results. The second file is only used when testing the model. When in production, new data will be processed by the system without any need for a hold-out file.
Click the big green “Analyze” button to build predictive rules. Note, there are now two files in the data list. The second file has the word “test” in it; this will be used later.
Emcien will now analyze the file by transferring the file into a working stage, parsing the file by placing each token on the graph, computing the predictive rules, and loading the results into a database. Once the rules are ready, click the big green “View Rules” button.
Step 5: Reviewing the Rules (Optional)
Emcien’s predictive rules do not require human review, auditing, or validation. Emcien creates hundreds to thousands of rules for the Emcien predictive engine to use. The complete set of rules is the predictive model. These rules are used in concert, meaning no one rule determines an outcome.
Exploring the rules is easy on the Rules page. Simply sort or select by desired outcome. To view a Rule in detail, click the little speech bubble on the right side of any Rule. This is called the “Tell Me”.
Step 6: Making Predictions
Now that we have Rules, it is time to use those rules on the hold-out set to determine the quality of our predictions. To make predictions, click the “Predict & View Outcome →” link at the bottom of the right column.
Next, make sure you select the “test” file that we created with Bandit. Click the big green “Predict” button.
This is what is called “Batched” prediction mode. There are three methods for making predictions:
- Batched Predictions - Using the UI or the RESTful API to predict up to a million predictions at a time. This method is useful when process large sets of data and latency is not as critical.
- Real-time Predictions - Using the UI or RESTful API, this method makes one predictions at a time with millisecond latency. This approach is useful when you have an inline process that requires low latency results.
- Edge Predictions - This method is headless, meaning no UI or API is available. Edge Predictions employ a C-Language, UNIX binary that takes standard in and returns results on standard out.
Once batched predictions are complete, click the “View Predictions” button.
Step 7: Reviewing Prediction Accuracy & Capture Rate
With Batch Predictions we receive a confusion matrix that allows us to assess the quality of the predictive rules. Using this matrix gives us a sense of how well our predictions described the holdout data, and whether these rules should be deployed to the data center.
A rule of thumb is that you want the diagonal green line to be as green and solid as possible. Green represents “True Positives”, which means we correctly predicted the outcome.
At a quick glance we see that our accuracy and capture rates are at 90% or better for every outcome except Smurf. Smurf appears to be a very difficult attack type to predict. Overall, these results are excellent, and are ready to be deployed into the data center without additionally augmenting the input data.
With Emcien’s three different methods for making predictions, there are options on how to deploy the prediction engine to production. We will discuss two options that are popular.
Real-time RESTful API
When your architecture relies on web services or microservices, the Emcien RESTful API is very powerful. This API endpoint allows you to send a JSON-serialized version of a transaction or event to the Emcien Prediction Server. The prediction is made in milliseconds and the HTTP response includes a JSON serialized payload with the prediction, outcome, and reasons why the prediction was made.
Predictions at the Edge means you can embed the Emcien engine into network gateways, routers, or boundary devices to watch incoming traffic, predict if that packet is an attack type, and route it to a black hole or sink hole. Emcien provides binaries for x86 and ARM devices.
Edge predictions allows for thousands of predictions per second and reduces latency to milliseconds. For more detailed information please read this Emcien knowledge base article.
This walkthrough used lab data to demonstrate the process and simplicity of using Emcien. We see that by taking only a few steps, and performing no feature engineering or model tweaking, Emcien was able to deliver a predictive model that predicted and captured DDOS attacks. Feel free to try these steps on your installation of Emcien or try one of our other sample data sets available here.