Portia is an excellent open source Visual web scraper. However installing portia is a daunting task for most of the non-linux type people. Guys at scrapinghub have tried to make it very simple to install but most of the time, due to dependencies the install is complex.

Here is what worked for me. I did it on a Ubuntu Machine, but it should work on any other debian based distro also where you have apt installed.

Part 1 : Installing Portia

Step 1. Install the dependencies 

Run the following commands to get all the dependencies. This is what causes most of the trouble later.

Sudo apt-get install build-essential python-dev
Sudo apt-get install python-pip
Sudo apt-get install python-scrapy
Sudo apt-get install git

Step 2. Install Portia

Download portia from here, extract it and run the following command from slyd directory.

Sudo pip install -r  ./requirements.txt

It will read from requirements. txt and probably tell you that a lot many requirements are already fulfilled ( as you already installed some requirements in step 1). It take some time to finish and would probably give you some warnings.

Step 3.  Run Slyd

If everything went well, you should be able to start twisted web server with slyd. Go to Slyd directory and run

twistd -n slyd

If you get errors, there are some requirements missing. If not, you can see that an instance of twisted server has started.

Part 2 : Annotating Web page

1. Run slyd server. Go to slyd directory and run

twistd -n slyd

2. Point your browser to

http://localhost:9001/static/main.html

This should show you the main Slyd page.

3. Input here the starting page.

4. Annotate the page. Start clicking and tagging the fields.

5. Save the project. By default all the projects are saved as newproject1,2 etc and inside that you have a crawler named after website .

Part 3 : Run your Spider and dump data to csv

Now that you have tagged a page, you can ask your spider to crawl, collect data and dump data in a file of your choice.

go slyd/data/project directory. Here you can see all your projects. Run

Portiacrawl <project name> <spider name> -t csv -o <outputfile.csv>

Your spider starts to crawl and dump the data to outputfile.csv in csv format.

12 Comments

  1. Kyle

    Thank you for posting this tutorial. I have been searching the web and I can not seem to find an answer on the PROPER syntax to run ‘porteacrawl’ from the CMD promt.

    Could you please tell me the syntax you would use to initiate ‘portiacrawl’ given the following project directory and spider name:
    Project Directory location :C:\Python27\slyd\data\projects\Learning\spiders
    Spider name: Test

    To initialize, would it be: python portiacrawl Learning Test -t csv -o test.csv

    Reply
    • Akash Jain

      I see you are running it in windows. Unfortunately I don’t know the command on windows. However on Linux prompt, I simply use portiacrawl wihtout python. Your syntax is absolutely correct otherwise.

      I will be posting a virtual machine of Ubunu with portia installed on it soon. You can downlowd virtualbox and run portiacrawl within VB so there are no installation hassles.

      Reply
  2. jim

    Thanks for this great tutorial.
    How can I run “twistd slyd” at system startup of ubuntu? I want that the website still works after a reboot.

    Reply
    • Akash Jain

      You can add the command to your /etc/rc.local file using gedit , just before exit 0 . For this you can run on command prompt ‘gedit /etc/rc.local’.

      or you can add the command to your startup using Ubuntu startup programs.
      Command would be

      cd /path_to_slyd && twistd -n slyd

      first part is changing directory to where slyd is and second part is running twistd server in that directory.

      Reply
      • jim

        thanks. it works!

        Reply
      • jim

        Do you know how to start portia as a different user on ubuntu server startup?
        I tried the following without success:
        cd /path_to_slyd && su myUser -c twistd -n slyd

        Thank you!

        Reply
      • BBB

        I need detailed procedure for how to extract data using portia alone..

        Reply
  3. Jan

    Many thanks Akash Jains works 100% Tutorial is spot on for Portia very nice !!!
    Best tutorial 2014 for Portia !!!

    Reply
  4. MrT

    do you know how to add the URL to the data extracted? I made my spider with portia, but I don’t know how to add the URL where the data came from.

    thks

    Reply
    • Akash Jain

      sorry, but couldn’t find that setting either. I will post it here if I find out.

      Reply

  5. /static/main.html
    You can find the URL of the extraxted page in the head tags.
    Selected tag randomly and click on the icon parameter.
    Click on html and head and finally base. you have the url

    You can extract all meta tags

    Reply
  6. BBB

    Plz send me the portia tutorial..

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *