[MySQL] Using MySQL to analyze web usage from Windows ISA logs
This tutorial will teach you how to import a tab delimited file in to a MySQL database. I built a program to do this for Windows ISA Server logs which contained over 500,000 lines which was way too many for Microsoft Excel to handle and too much for notepad to handle properly. What MySQL functions will we be using:
- Load Date Local Infile
- Select
- Delete
- Drop
Firstly I will show you how to put a tab delimited log from Windows ISA Server in to the MySQL database. With the logs that I have been working with a days worth of information has been between 500,000 and 700,000 which is around 150-200MB in size alone.
You will find that if you use the default PHP timeout settings that you will timeout with large operations such as importing such a large file in to MySQL.
I suggest that you open a command prompt and run through the process below to import the file and see how long it takes to import. If you have previous logs, choose the biggest one so you can get a worst case scenario.
Before we can import the data into a MySQL database we need a database and table to import the data in to. I am not going to go in to much depth for this as it is quite simple to do using PHPMyAdmin or a similar program.
For the ISA log table you can use my code that I have here:
CREATE TABLE `tbl_log1` (
`id` int(11) NOT NULL auto_increment,
`ip` varchar(100) collate latin1_general_ci NOT NULL,
`username` varchar(100) collate latin1_general_ci NOT NULL,
`browser` varchar(100) collate latin1_general_ci NOT NULL,
`date` varchar(100) collate latin1_general_ci NOT NULL,
`time` varchar(100) collate latin1_general_ci NOT NULL,
`isa` varchar(100) collate latin1_general_ci NOT NULL,
`blank` varchar(100) collate latin1_general_ci NOT NULL,
`proxy` varchar(100) collate latin1_general_ci NOT NULL,
`proxyip` varchar(100) collate latin1_general_ci NOT NULL,
`proxyport` varchar(100) collate latin1_general_ci NOT NULL,
`num1` varchar(100) collate latin1_general_ci NOT NULL,
`num2` varchar(100) collate latin1_general_ci NOT NULL,
`num3` varchar(100) collate latin1_general_ci NOT NULL,
`protocol` varchar(100) collate latin1_general_ci NOT NULL,
`method` varchar(100) collate latin1_general_ci NOT NULL,
`url` varchar(255) collate latin1_general_ci NOT NULL,
`type` varchar(100) collate latin1_general_ci NOT NULL,
`num4` varchar(100) collate latin1_general_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci
Feel free to change the encoding type if you wish.
Now for the code. To import the data we will use the Load Data Local Infile ability of MySQL. This will load a file that is locally on the machine to the MySQL database given the right parameters.
load data local infile 'C:\\logfile.log'
into table tbl_log1
fields terminated by '\t'
lines terminated by '\n'
(ip,username,browser,date,time,isa,blank,proxy,proxyip,proxyport,num1,num2,num3,protocol,method,url,type,num4)
The last line is a list of the columns in order that are in the log. The only columns we really want are username, date, time and url, although you may want ip as well if you are monitoring network activity on static IP address so you can see where most of the bandwidth is being used. If you are importing a different file then change the columns to the names of your columns in your delimited file. The names of the columns within the brackets are the names that you gave them in the table.
For example if you have a list of usernames, passwords and emails you will want 3 columns and a last line looking like this:
(username,password,email)
The column names have to be in order. If email is the first column in the file then make it the first word within the brackets.
The 1st line should be the same for everyone except for the location of the file. Of course using the load data local infile function the file has to be on the machine itself and can not be done through the network. Also note the 2 backward slashes, the first backslash is to espace the following backslash.
The 2nd line is the table that we are importing to, make sure you have set up a table for your data to go in to.
The 3rd line sets the delimiter type. \t is the delimiter for tab and \r is for a space. I am unaware of what the others are so if anyone knows please let me know.
The 4th line sets the delimiter for the end of the line. In this case each line was a new row on the file in excel so this is \n for new line.
The 5th line is the list of columns that we are importing into.
The only problem is if you are importing a large log file like I was. What I had to do was cut out all the rubbish that I didn't need to make the databases faster for queries and the like.
First of all I deleted all the columns I didn't need.
ALTER TABLE `tbl_log1`
DROP `ip`,
DROP `browser`,
DROP `isa`,
DROP `blank`,
DROP `proxy`,
DROP `proxyip`,
DROP `proxyport`,
DROP `num1`,
DROP `num2`,
DROP `num3`,
DROP `protocol`,
DROP `method`,
DROP `type`,
DROP `num4`
This took out all of the columns I didn't need and brought the size of the file down quite a bit.
The second thing was that the log I had, had a lot of anonymous results in it, roughly around a third of the rows. So I ran this little bit of code to remove all the anonymous results.
DELETE FROM tbl_log1 WHERE username like "%anonymous%"
The problem with this is removed a lot of these results leaves a lot of overhead in the database. So you can either optimize the table after each delete query or you can run it at the end once you start searching.
OPTIMIZE TABLE tbl_log1
This will remove the overhead on the table and will decrease the query times.
You can now use queries to search for users internet usage or a certain websites users. You will find a lot of websites in the list, if a user typed in www.microsoft.com it wouldn't just say www.microsoft.com in the log, it would have the url of every image that the user downloaded, every CSS file, every page....everything.
You may also want to set up other queries to remove certain websites that you know are safe and don't mind about such as www.microsoft.com:
DELETE FROM tbl_log1 WHERE url like "%.microsoft.%"
To search for anyone going on the BBC website you could type the following:
SELECT username FROM tbl_log1 WHERE url like "%bbc.co.uk%"
This will display a list of users that have been on BBC. But this will come up with multiple duplicates of the usernames due to the issue stated above. To stop this we can add the GROUP BY parameter to our query to group the usernames together so we only have a single result from each user.
SELECT username FROM tbl_log1 WHERE url like "%bbc.co.uk%" GROUP BY username
This tutorial was originally posted on Skeletorscorpse.com