Okay, I doubt this post will be useful to anyone else, but I thought I might as well post it since I haven’t updated this blog in a while.
On ‘my’ web server I have some sites that are served by a shared Apache process. This stores daily logs, for one week, rotating the log numbering each day. I have a cron job that runs once a week and concatenates the past week’s logs into a single file, which is then stored elsewhere.
Today I was trying to run awstats to generate stats from the last months worth of logs (I check my sites stats once every month). However, I realised that my cron job was concatenating the daily logs into the weekly log file in the wrong order. This resulted in awstats only picking up the first day from each weekly log, as the rest of the log file had requests with earlier timestamps. It seems that awstats works through the log file chronologically.
So, to get my stats I had to re-order the log files so that the entries were listed chronologically as they should be. To do this I decided to use PHP, mostly just because that is the language I am most familiar with.
The script simply loops through the log file, adding each line to an array, using the request timestamp as the key. Then the array can be sorted by the key, and output back to the file.
I ran the script using the PHP CLI, and found I had to set a large memory limit to avoid out of memory errors. The exact amount of memory needed will depend on how large your logs are.
<?php //Usage: ~/path/to/php/bin/php -d memory_limit=128M ./whatever-you-name-this-file.php $sites=array('xoogu', 'xoogu-static1', 'another-domain', 'another-domain-static1', 'domain3', 'domain3-blog'); $dates=array('20131215', '20131222', '20131229', '20140105' ); //loop through all the sites foreach($sites as $site){ //loop through all the dates foreach($dates as $date){ //array to hold each line from the log file $records=array(); //log file location $logFile="/path/to/logs-archive/$site/$site-access-$date.log"; //open the logfile for reading $handle=fopen($logFile, 'r'); //initialise a counter used to ensure we don't loose log entries that occurred at the same time $i=0; //read the log one line at a time while($str = fgets($handle)){ //parse the date from the log entry $recordDate = new DateTime(substr($str, strpos($str, '[')+1, 26)); //store the log entry by the date with our counter concatenated on the end to avoid overwriting an existing record for a different request that occured at the same time $records[$recordDate->format(DateTime::ATOM).$i]=$str; //increment the counter $i++; } //close the file fclose($handle); //sort our records to be in date order ksort($records); //rename the unsorted log file so we still have a copy if anything goes wrong rename($logFile, "$logFile-unsorted"); //the original file has been renamed, so now create a new version of the file and open it $handle=fopen($logFile, 'w'); //write the records in order to the file foreach($records as $record){ fwrite($handle, $record); } //close the file fclose($handle); //output progress to screen echo "$site-access-$date.log sorted\n"; } }
A couple of points to make:
My weekly logs are stored in the format $site-access-$date.log
, e.g. for this site the log would be xoogu-access-2014-01-07.log
.
The format of the logs is like this:
208.115.113.87 - - [28/Dec/2013:03:15:45 +0000] "GET /robots.txt HTTP/1.0" 200 125 "-" "Mozilla/5.0 (compatible; Ezooms/1.0; help@moz.com)"
So if your log uses a different format (possible), or you use a different structure for storing your logs (very likely), then you’d need to modify the script appropriately. However, I’d be pretty surprised if anyone other than me would have a need for this script. (If you do, leave a comment below). And I shouldn’t even need it any more now I’ve corrected my weekly log archiving cron job.