4 February 2016

Python crawler controller


I work on a project where I have written 20+ crawlers and the crawlers are running 24/7 (with good amount of sleep). Sometimes, I need to update / restart the server. Then I have to start all the crawlers again. So, I have written a script that will control all the crawlers. It will first check if the crawler is already running, and if not, then it will start the crawler and the crawler will run in the background. I also saved the pid of all the crawlers in a text file so that I can kill a particular crawler immediately when needed.

Here is my code :

import shlex
from subprocess import Popen, PIPE

site_dt = {'Site1 Name' : ['site1_crawler.py', 'site1_crawler.out'], 
'Site2 Name' : ['site2_crawler.py', 'site2_crawler.out']}

location = "/home/crawler/"

pidfp = open('pid.txt', 'w')


def is_running(pname):
    p1 = Popen(["ps", "ax"], stdout=PIPE)
    p2 = Popen(["grep", pname], stdin=p1.stdout, stdout=PIPE)
    p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits.
    output = p2.communicate()[0]
    if output.find('/home/crawler/'+pname) > -1:
        return True
    return False


def main():
    for item in site_dt.keys():
        print item
        if is_running(site_dt[item][0]) is True:
            print site_dt[item][0], "already running"
            continue
        cmd = "python " + location + site_dt[item][0] + " -l info"   
        outfile = "log/" + site_dt[item][1] 
        fp = open(outfile, 'w')
        
        pid = Popen(shlex.split(cmd), stdout=fp).pid
        
        print pid
        print
        pidfp.write(item + ": " + pid + "\n")
        
    pidfp.close()
    
    
if __name__ == "__main__":
    main()

If you feel that there is scope for improvement, please comment.

0 comments:

Post a Comment

Thank you for comment. We will try to enhance the quality of this website.