Fun With Python: Log Analysis with Regex
EDIT: This post has been edited from its original. I found a flaw in my logic and have rewritten the code. -I. Tea. Security. 03/30/25
Introduction
My recent posts have been a bit heavy and wordy. This one is interactive, and I will try to be brief.
In my recent job searches I have found that scripting knowledge is listed as required for most postings related to analyst, engineer and ops roles, with Python specifically mentioned. So, I have been studying and practicing to build my skills with it.
Python is a powerful tool with very simple syntax. It has many uses, large and small. For an IT security pro it is especially helpful for automating routine tasks and for parsing large data sets.
I have been working on a utility to help with parsing and searching logs. I’d like to share a small portion of that today, in the hope that others might find some use in it.
I offer it to the reader under the terms of the GNU General Public License. The specific terms of the license can be found here: https://www.gnu.org/licenses/gpl-3.0.html
It is meant for educational purposes, is offered without warranty and the author is not responsible for any damages caused by its use.
This project has been especially enlightening for me because it has helped reinforce my knowledge of Python and regex. Regex is used to find patterns in text. It is a little difficult to understand at first, but once learned, can be a powerful tool for analysis. Combined with Python, it can be used to search large amounts of data for things like SSNs, CC#s, and other PII, which can be very useful in DLP and other data governance activities. In this case, I have leveraged the duo to search a firewall log for IPv4 addresses.
Preparation
These instructions are for Windows (sorry!) and I assume some basic level of computer knowledge.
The following is an example log entry that will be used to test the Python script. Copy and paste this into Notepad and save it as a .txt file named “log.txt” (no quotes).
date=2019-05-10 time=11:37:47 logid="0000000011" type="traffic" subtype="forward" level="notice" vd="vdom1" eventtime=1557513467369913239 srcip=10.12.13.45 srcport=5812 srcintf="port4" srcintfrole="undefined" dstip=23.59.154.35 dstport=80 dstintf="port11" dstintfrole="undefined" srcuuid="ae28f345-5252-38e9-f325-d1d2ce321f4b" dstuuid="ae28f494-5735-51e9-f247-d1d2ce663f4b" poluuid="ccb269e0-5735-51e9-a218-a397dd08b7eb" sessionid=105048 proto=6 action="close" policyid=1 policytype="policy" service="HTTP" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=172.6.34.67 transport=58012 appid=34050 app="HTTP.BROWSER_Firefox" appcat="Web.Client" apprisk="elevated" applist="g-default" duration=134 sentbyte=572 rcvdbyte=521 sentpkt=12 rcvdpkt=12 utmaction="allow" countapp=1 osname="Ubuntu" mastersrcmac="a5:e4:00:ec:25:56" srcmac="a5:e4:00:ec:25:56" srcserver=0 utmref=52625-267
The Code
The following is very simple code used to find IPv4 addresses in any a .txt file, and print the results to the screen. Feel free to use it according to the GNU GPL.
Copy and paste this code into Notepad:
#import the Regex libraries
import re
#open the log.txt file and store it in a variable
log_file = open("log.txt")
#read the log.txt file, line by line, and store it in a list variable
log_strings = log_file.readlines()
#read the list log_strings, find IPv4 addresses, and print them to screen
def find_ipv4():
ipv4_list = re.findall(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", str(log_strings))
print(ipv4_list)
#call the function find_ipv4()
find_ipv4()
Save the file in the same directory as the log.txt file. Name it “analyzer.py” (no quotes), and make sure to choose All Files in the dropdown for Save as type, or it will not save as .py:
Python
If you don’t already have it, go get it!
https://www.python.org/downloads/
Run the installer and follow the prompts. I will wait…
The Script in Action
After Python is installed, it’s time to run the script.
Open a command prompt and navigate to the folder that contains log.txt and analzyer.py.
Type the following command:
python analyzer.py
The script should print the results to the screen:
['10.12.13.45', '23.59.154.35', '172.6.34.67']
Cool! Of course, I also have some functions to add context to these results, and to manipulate them, but I’m not here to give away the farm.
Challenge: How can this be tweaked for IPv6 addresses?
For the sake of brevity, I will not dissect the script fully. It is a very simple script that should be understandable with basic Python and regex knowledge. Check out W3Schools for some excellent training and references.
However, I will explain the function that does the heavy lifting, and the regex search string used, as it may be confusing for anyone new to using regex with Python.
First I will explain the most complicated component:
ipv4_list = re.findall(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", str(log_strings))
This line uses a function provided by the regex library “re” called “findall()”, which will search a string for whatever it is told to find. It then returns a list of all matches.
I have stored the returned list of matches into the variable “ipv4_lst”.
The function takes two variables. The first is the string to find. This can be text, or as in this case, a regex expression representing a string. The second is the string to be searched. Here, I pass it a string version of the list that contains all lines from the log file.
Finally, I’ll explain the regex string itself.
The letter “r” tells the function that the following string should be treated as a raw string. This way, it knows how to handle the string and any escape characters within.
Looking at the regex string and thinking logically about the output of the script, a patter should be recognizable: an IPv4 address expressed as regex.
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} = x.x.x.x = 0.0.0.0 - 255.255.255.255
The pattern is four sets of the same regex expression, which represents a character, separated by a “.”:
\d{1,3}\. = a single octet of an IPv4 address
Breaking it down further, the \d actually means that I am looking for a digit between 0 and 9. However, that is ineffective for this use case.
So, I have to tell it that what I really want is an integer between 1 and 3 characters, as each octet of an IPv4 address can be 1 to 3 characters long (0-255).
To do so, I use {1,3}. Together with \d, this means I want to find a number that is between 1 and 3 characters long.
Each of these regex expressions is then separated by a full stop. However, in order to make the function recognize the full stop in this context it must be prepended with a “\”, otherwise it will be overlooked, or may cause a syntax error.
So, putting it back together I get a regex expression resenting an IPv4 address:
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
I then have to surround this by parentheses because I want to to find all of the expressions together as a group:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
Then I have to surround it with quotes because it is technically a string, but re and its functions know that it is not a literal string but a regex expression representing a string:
“(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})”
Don’t forget to tell it that the string is raw b adding “r” at the beginning:
r”(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})”
Then, I pass it to the function, along with the string I want to search, and store the result in a list:
ipv4_list = re.findall(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", str(log_strings))
Then, finally, just call it:
find_ipv4()
There are many ways this task could be approached. So far, I have found this to be the cleanest, most accurate approach, with the least amount of code, but I will keep tweaking it. It is also modular enough that it can be added to for improved functionality. It can read any text file, and anything can be done with the results.
Phew! So, not as brief as I promised. Time for a tea break.
Daily Cuppa
Today’s cuppa is Mandarin Mint Mindfulness provided by Yogi. Organic and ethically sourced.
It has a mellow fruity, minty flavor with a hint of rootiness.
It warms the belly and the soul, and is perfect for calming the mind and body after some hefty code and log analysis.