Using Machine Learning To Detect Anomalies
Credit to Author: dmitryc| Date: Mon, 21 Dec 2015 22:07:07 +0000
I’m going to start blogging more about detection of protocol/app anomalies, detection of lateral movement and/or data exfiltration, and more. For many years I have been watching users and applications furrow their way across networks and I’m gonna start data-dumping that info here 🙂
But…first…I manage a web server for a friend. It occurred to me that machine-learning could be useful in alerting when an attack is under way. I took the following steps
1) Get as much data as possible for this device. For Apache, this just meant gathering all the log files.
2) Parse the data and, for each session, look at the path taken as the user or bot perused the server (Note: outside of my initial scope, but timestamps are useful here to weed out a user versus a machine).
3) So, an average session will look like R1->R2->R3->RX where each “R” is a request. So R1 could be index.html, R2 could be “Contact Us”, R3 could be “contact_form.php”, etc. I started using Markov to build a model; however, instead, I took each set of 2 and initialized those values…e.g. S={R1->R2,R2->R3,R3->RX}. For the next session I might have S={R1->R5,R5->R3,etc.}. At the end of all the parsing, I have a big set of all state transitions possible for each R. So, given RX, there are a finite number of R states that RX can transition to.
4) For each of the R states, I now re-parse the log file and find the number of transitions. This is a matrix that shows the number of observed transitions from RN to every other R state. So, for instance, let’s say that R1 goes to 3 possible states : R4 (27% of time), R11 (3% of time) , and R12 (70% of time). Then the R1 row of our matrix looks like [0, 0, 0, .27, 0, 0, 0, 0, 0, 0, .03, .7]
5) There were some special cases that I had to account for (any page transitioning to the main page, any page transitioning to itself, etc.). Once I accounted for these, I ran my program against the log files and created LOW, MEDIUM, and HIGH alerts. I didn’t use a true standard deviation and I ignored the LOW and MEDIUM stuff…I just wanted the hits where the number for that transition was extremely low or 0. From our example above, this would be a transition like R1->R2=0. I didn’t really expect great results and figured that I would have to do a lot more tweaking…well, this wasn’t the case. I actually got really, really good data on my first run. Example:
732 total state transitions tracked
HIGH RISK GET /componentes3.7/fckeditor/editor/fckeditor.html->GET /affiliate/affiliate53/fckeditor/editor/fckeditor.html
HIGH RISK GET /portfolio/aui/FCKeditor/editor/fckeditor.html->GET /componentes3.7/fckeditor/editor/fckeditor.html
HIGH RISK GET /wp-content/uploads/wpfouot.php->POST /wp-content/plugins/Login-wall-etgFB/login_wall.php
etc.
So, I can use really basic machine learning to find my attackers in my web logs. I then parse out the attackers’ IP addresses and can throw them into a firewall ruleset. In the future, I would like to automate this and find when my server is under attack, send a message to my firewall which drops in a route rule which spins all of the attackers traffic to my honey net 🙂
Speaking of honeypots, You can also honeypot certain pages. For instance, I could create bogus files or directories based on what I see attackers going after (like the report from above) and drop canary tokens in there to (see Canary Tools). I can embed honeypot links within HTML comments and see where bots (or humans) are taking links from commented code and trying them out. I can put links in my robots.txt file and see who goes after them…there are so many ways to do this…and, at the end of the day, I can either run these attackers off my network or into a fake network…it’s just TONS and TONS of fun 🙂
!Dmitry
dmitry.chan@gmail.com