Thursday, February 5, 2015

Python - Text Crawler

I find Python extremely useful at many scenario's. Certain kind of work I do involves lot of manual efforts and consumes lot of time. I started slowly experimenting with python then and there to automate few things with which I work.

One such case where python is very useful for me is, parsing large text files. It often happens that we need to parse huge file to pull data or parse log file to create reports. In all these cases python comes in very handy. I will share a small snippet below to show how powerful is python with regular expression module.





We need to identify the pattern which we want to track down and create a regular expression for it. We can write a template kind of snippet where we can change the regular expression to re use the code.


#import statements
import os
import sys
import re

"""function main holds the base logic and validation.
This function parses the given file to pick the specified pattern"""
def main(argv):
 if len(argv)==1:
  file = argv[0]
  if os.path.isfile(file):
   fh = open(file,'r')
   data = fh.read();
   filter = re.findall(r'\[error\] \(105\)(.*?) -Permission denied',data,re.DOTALL)
   print filter
  else:
   print 'provide a valid file'
 else:
  usage()


"""This prints the usage of the script when input is not provided as excepted or input not in proper context"""
def usage():
 print 'Usage: python crawler.py path'



# boilerplate template - it invokes main function
if __name__ == "__main__":
 main(sys.argv[1:])


Below is the sample file content I used as input

[Sun Mar 7 16:05:49 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 16:45:56 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:13:50 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:21:44 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:23:53 2004] [error] (105)sample.txt -Permission denied
[Sun Mar 7 17:27:37 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:31:39 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:58:00 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:00:09 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:10:09 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:19:01 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:42:29 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:52:30 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:58:52 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:03:58 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:08:55 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:22:11 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:31:25 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:23:53 2004] [error] (105)template.txt -Permission denied
[Sun Mar 7 18:42:29 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:52:30 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:58:52 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:03:58 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:08:55 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:22:11 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 19:31:25 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 17:23:53 2004] [error] (105)example.txt -Permission denied
[Sun Mar 7 18:00:09 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:10:09 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:19:01 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:42:29 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 18:52:30 2004] [info] (104)Connection reset by peer: client stopped connection before send body completed

below is the output I got

['sample.txt', 'template.txt', 'example.txt']


Links to browse about python regular expressions

https://developers.google.com/edu/python/regular-expressions

https://docs.python.org/2/library/re.html

Thanks for reading

Cheers!

  

No comments:

Post a Comment