Mozilla Skin
Translate this page


Scraper Engine

From Moving Pictures

Contents

Script Basics

The root node of a scraper script is 'ScriptableScraper'. There is no exception to this. Every scraper script needs to have a 'details' node. The details node is a child of 'ScriptableScraper' and gives some information about the scraper to Moving-Pictures. The information is not just for the benefit of Moving-Pictures to know how to handle your script, it is also for would be users of your script to know what the script is for, what language, etc. Please read more about the details node here (link to details node page).

Apart from the details node, every scraper script also needs to have at least one action node The name attribute on an action node tells the scraper engine what this action is for.

  • Movie Details scraper requires the following action nodes
  • Cover Art scraper requires the following action node
    • <action name="get_cover_art">
  • Backdrop scraper (needs information)

The search node is what builds the possible matches. The get_details node is responsible of setting the movie details.

Before continuing, you should know that a scripts version number and release date must be changed before anyone can use it over a previous version of the script. The only exception to this is if you are in debug mode. Debug mode can be enabled or disabled from the Data Sources window. Don't forget to also set MediaPortal log verbosity to debug. This mode is intended for script authors to be able to build/edit scripts with enhanced logging of information. Also while in debug mode, Moving-Pictures will let you update a script without changing the release date or version number. Change the version number and release date before posting any script. If this is not done, users will not be able to use your script.


Base example of a movie details scraper:

<ScriptableScraper>

<details></details>
<action name="search"></action>
<action name="get_details"></action>

</ScriptableScraper>


Variables

Variables can either be assigned via the 'Set' node, or they can be returned from a node as is the case with the retrieve and parse nodes.


When you assign a variable with the set node, you can use this variable later in your script. This is also how you would set properties of your movie within a script. It is important to note that how you set a variable and how you call one are different. Below is an example of setting a variable named 'varName' to a value of 'varValue'. If you would want to use this defined variable later in your script you would call it by encasing the variable name within ${}. In the example of a set node below this would be ${varName}.

<set name="varName" value="varValue" />


Lets pretend that varValue has something useful in it like the title of a movie. We could assign varName to movie.title. This is the case in the example below:

<set name="movieTitle" value="What About Bob" />

<set name="movie.title" value="${movieTitle}" />


Usually your data source is going to be doing web requests, and parsing the web page for the information you are looking for. This involves two new node types:

  • Retrieve - This node does a web request and assigns it to a variable
  • Parse - This node takes a variable (a web page for example), and parses it based on a regular expression. The return on the regular expression is assigned to a new variable. Below is an example of this:

 <set name="rx_search_results">
      <![CDATA[
      ><a href="/title/(tt\d{7})/"[^>]+>(?!")([^<]+)</a> \((\d{4})[\/IVX]*\)(?! \(VG\))(<br>.*?aka\s(<em>.+?</em>).*?</td>)?
      ]]>
    </set>


<retrieve name="search_page" url="http://akas.imdb.com/find?s=tt;q=${search.title:safe}" />

<parse name="details_page_block" input="${search_page}" regex="${rx_search_results}"/>

Do not get freaked out if you did not understand this. A lot of what is going on here has not been explained yet. This first chunk is a set node (simple enough) to define the regular expression used in the parse node later on. If you do not set your regular expressions in this way, you will need to convert your regular expressions into a form that XML will not error out on (use & gt; for >, & lt; for <, etc.). That would be perfectly valid, but not very easy to read. The retrieve node simply does a web request to the url attribute's value and assigns the page to a variable named the value of the name attribute's value. The parse node then takes the contents of the page via it's input attribute and tests it against its regex attribute. The results of the regular expression against the web page are assigned to a variable defined by the value of the parse node's name attribute. In this case the result/results from your regular expression are encased in the variable ${details_page_block}.


Often a regular expression will turn up more than one result especially when you are searching for movies. That is why the parse node returns an array of results. Lets say that the regular expression in the example returns 4 items per result. The regular expression returns the following: Title, Year, IMDb ID, and details page in that order. If you wanted to get the first result's year you would call this: ${details_page_block[0][1]} The parse node returns a two dimensional array where the first dimension is the result index and the second dimension is the index of the items returned by the regex. Use regular expression's backreferences to return values to the parse node array.

Variables Passed to Scraper

The action nodes have a couple predefined variables that you can work with. These variables are as follows:

  • ${search}
    • This variable is a collection of information that Moving-Pictures gathered from the movie file. This variable has multiple items in it that if available are passed to it. To call these items, review the following.
      • ${search.title}
      • ${search.year}
      • ${search.imdb_id}
      • ${search.disc_id}
      • ${search.moviehash} (1.0+)
      • ${search.basepath} (1.0+)
      • ${search.foldername} (1.0+)
      • ${search.filename} (1.0+)
      • ${search.filename_noext} (1.0+)
      • ${search.clean_filename} (1.0+)
  • ${settings}
      • ${settings.defaultuseragent} (1.3+)
      • ${settings.mepo_data} (1.3+): Path to the MediaPortal data directory.
  • ${movie}
    • This variable is the collection of information scraped about your movie. The overall goal of your scraper is to add information to this item, but part of what the 'search' action node does is get some basic information about your movie. Your other action nodes will use the information your 'search' action node found to process their scraping tasks. (This will be covered in the action node page.) To call these items, review the following:
      • ${movie.title}
      • ${movie.site_id}
      • ${movie.alternate_titles}
      • ${movie.year}
      • ${movie.directors}
      • ${movie.writers}
      • ${movie.actors}
      • ${movie.genres}
      • ${movie.studios} (1.3+)
      • ${movie.certification}
      • ${movie.language}
      • ${movie.tagline}
      • ${movie.summary}
      • ${movie.score}
      • ${movie.popularity}
      • ${movie.runtime}
      • ${movie.movie_xml_id}
      • ${movie.imdb_id}
      • ${movie.details_url}

Variable Modifiers

Variable Modifiers are methods you can call on your variable to change the value of that variable. A modifier is called by putting a colon ':' after the variable name then the modifier name. Please review the following table for a list of available modifiers.

Modifier Description
safe This modifier changes the value of a variable to be URL safe. An optional parenthesis can be used to specify the character encoding to use. For example safe(ISO-8859-1).
htmldecode This modifier changes the value of the variable by decoding any HTML special characters (& amp; for example).
striptags This modifier removes any HTML tags found in your variable.

An example of a variable calling a modifier would be the following:

${testVar:safe}

Closing

That is about it for the basics on the scraper engine. There are nodes to add some logic to your scraper for looping and conditionals. Please review the nodes page for further information.