Access django models inside of Scrapy
Asked Answered
P

8

34

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I've seen this, but I don't really get how to set it up?

Pyretic answered 24/11, 2010 at 22:9 Comment(5)
https://mcmap.net/q/112738/-use-django-orm-as-standalone-duplicatePaucity
possible duplicate of Use only some parts of Django?Paucity
That's not really what I am looking for because I already am using django. I don't want to just use the ORM. I also don't want to have to maintain to separate settings files.Pyretic
You want to use one part of Django: the ORM. That's a common question. Please search. The Django site referenced in that question has the specific ways to use the ORM separately without extra settings. Please actually read the question, the answers and follow the links. This is a common question. It's been answered.Paucity
Sorry S. Lott, this is not the same question.Corroboration
P
26

If anyone else is having the same problem, this is how I solved it.

I added this to my scrapy settings.py file:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/path/to/django/project/')

Note: the path above is to your django project folder, not the settings.py file.

Now you will have full access to your django models inside of your scrapy project.

Pyretic answered 29/11, 2010 at 16:10 Comment(6)
Here's a related answer that includes pipeline.py code: #7883696Lawson
Just a small note. With the new project layout in Django 1.4, the path should be setup_django_env('/path/to/django/project/project/')Overjoy
This solution was working great for me until I tried to deploy using scrapyd. When the scrapyd automatically builds the egg, it seems to be missing the Django code in package. I get this: Deploying my_scraper-1354463004 to localhost:6800/addversion.json Server response (200): {"status": "error", "message": "ImportError: No module named settings"} -- any advice how to handle this? It would seem that if I could get the Django code into the egg I'd be okay, but I'm not really clear on how to do that.Liver
I'm trying to use your solution but it's giving me an Import Error. Any chance you'd be willing to look at my [question] (#14686723) and offer advice?Sprite
bababa can you tell me how to i solve this error i followed you above mentioned steps sudo scrapy deploy default -p eScraper Building egg of eScraper-1370604165 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive contents... Deploying eScraper-1370604165 to http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "ImportError: Error loading object 'eScraper.pipelines.EscraperPipeline': No module named eScraperInterfaceApp.models"}Renter
in current django version from django.core.management import setup_environ will not work...then how to do thisSnowbird
D
21

The opposite solution (setup scrapy in a django management command):

# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

and in django's settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_project.settings'

Then instead of scrapy foo run ./manage.py scrapy foo.

UPD: fixed the code to bypass django's options parsing.

Defection answered 8/2, 2012 at 16:41 Comment(14)
@Mikhail: I'm actually trying your code snippet with Django 1.4 and Scrapy 0.14.3. Unfortunately, it does not work. For instance, if I want to execute python manage.py scrapy list inside the Django project folder, I always get ImportError: No module named cmdline. However, the module named cmdline does exist and the site-packages directory of my Python installation is in the PYTHONPATH as well. What am I doing wrong? Thanks in advance!Infirmity
do you have scrapy in your PYTHONPATH?Defection
Okay, I'm stupid. I solved the problem. First I thought that the line from __future__ import absolute_import wouldn't be necessary in Python 2.7. That's why I commented it out, but it only works with this line. Generally, I have some problems with understanding absolute and relative paths in Python. I definitely should read into this a bit more. Anyway, thanks for your help!Infirmity
@Mikhail: I just realized that I cannot execute Scrapy's command line options such as -o scraped_data.json -t json. I know how to add options to commands in general, but how to link them to Scrapy's counterparts?Infirmity
@Peter: please try the updated example. It should pass options to scrapy and not try to handle them as django's options.Defection
@Mikhail: Awesome! I never thought that this would be so easy. I don't know why it works but it works. Thank you so much! Meanwhile, I have found another solution in my own thread but yours is definitely the way to go. :-)Infirmity
@Mikhail: This is working for me great from the command line, thanks, but I can't run the management command from inside Django. If I try >>> from django.core import management >>> management.call_command('command_name') I get AttributeError: 'Command' object has no attribute '_argv' -- Any suggestions?Liver
I think you can instantiate the command and run it using "run_from_argv" method: myapp.management.commands.scrapy.Command().run_from_argv(['', 'crawl', 'dmoz'])Defection
Thanks -- that was basically it. I got this to work: myapp.management.commands.scrapy.Command().run_from_argv(['scrapy','','crawl','dmoz'])Liver
After going down this very deep rabbit hole I ultimately abandoned this approach for a number of reasons, such as 1) the inability to restart twisted, which means that only the first command would work, so trying to trigger from multiple user initiated actions is impossible. 2) having to re-write a whole bunch of scrapy to get around the problem that twisted assumes that it is started from the main thread. So I recommend that you use scrapyd instead and call as a webservice.Liver
@MikhailKorobov I'm currently dealing with deploying my spiders to a scrapyd server. However, when I execute python manage.py scrapy server, I get scrapy.exceptions.NotConfigured: Unable to find scrapy.cfg file to infer project data dir. How to resolve this?Infirmity
@MikhailKorobov You can find a more detailed explanation of my problem in this threadInfirmity
You don't need to bypass option parsing. You just need a POSIX style delimiter. See my answer to Peter Stahl's question.Qp
@MikhailKorobov How do you setup your scrapy project directory inside the django directory folders? Thanks!Expellee
L
16

Add DJANGO_SETTINGS_MODULE env in your scrapy project's settings.py

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'your_django_project.settings'

Now you can use DjangoItem in your scrapy project.

Edit:
You have to make sure that the your_django_project projects settings.py is available in PYTHONPATH.

Leishaleishmania answered 25/11, 2010 at 5:54 Comment(0)
O
2

For Django 1.4, the project layout has changed. Instead of /myproject/settings.py, the settings module is in /myproject/myproject/settings.py.

I also added path's parent directory (/myproject) to sys.path to make it work correctly.

def setup_django_env(path):
    import imp, os, sys
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

    # Add path's parent directory to sys.path
    sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))

setup_django_env('/path/to/django/myproject/myproject/')
Overjoy answered 27/7, 2012 at 2:59 Comment(1)
Note that the usage of setup_environ is deprecated starting from version 1.4.Infirmity
L
1

Check out django-dynamic-scraper, it integrates a Scrapy spider manager into a Django site.

https://github.com/holgerd77/django-dynamic-scraper

Lujan answered 11/1, 2013 at 13:14 Comment(0)
Q
0

Why not create a __init__.py file in the scrapy project folder and hook it up in INSTALLED_APPS? Worked for me. I was able to simply use:

piplines.py

from my_app.models import MyModel

Hope that helps.

Quadrivium answered 8/4, 2015 at 18:1 Comment(0)
M
0

setup-environ is deprecated. You may need to do the following in scrapy's settings file for newer versions of django 1.4+

def setup_django_env():
    import sys, os, django

    sys.path.append('/path/to/django/myapp')
    os.environ['DJANGO_SETTINGS_MODULE'] = 'myapp.settings'

django.setup()
Mcculloch answered 15/8, 2016 at 13:58 Comment(0)
U
0

Minor update to solve KeyError. Python(3)/Django(1.10)/Scrapy(1.2.0)

from django.core.management.base import BaseCommand

class Command(BaseCommand):    
    help = 'Scrapy commands. Accessible from: "Django manage.py". '

    def __init__(self, stdout=None, stderr=None, no_color=False):
        super().__init__(stdout=None, stderr=None, no_color=False)

        # Optional attribute declaration.
        self.no_color = no_color
        self.stderr = stderr
        self.stdout = stdout

        # Actual declaration of CLI command
        self._argv = None

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute(stdout=None, stderr=None, no_color=False)

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

The SCRAPY_SETTINGS_MODULE declaration is still required.

os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scrapy_project.settings')
Ursa answered 12/10, 2016 at 21:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.