Scraping the Internet

Warning: this technique has been partially deprecated. The tutorial hasn’t been updated yet. If you need help in this area, please contact me.

 

Getting information from the internet into a chatbot’s output can be very useful at some times. Not only to show continuously changing values, like weather information, but it can potentially also be used for learning, although the latter is obviously a little trickier.

Retrieving, or scraping info from the internet can be done remarkably easy with the chatbot designer. Here’s a screencast of a bot that retrieves weather information from the google weather api.

In the video, a .net plug-in is used to retrieve information from the internet by means of XPaths. This plugin is included by default in the application. Note though that plug-ins are only supported in the pro version.  Basic users will be able to use these projects, but they can’t create or edit any patterns that rely on plug-ins. Also, plug-ins are loaded on a project by project basis. So if you want to use the scraping features in your own project, you will first need to make certain that the correct .net functions have been loaded. Once this has been set up though, all plug-ins will Captureautomatically be loaded when the project is opened.

Loading

To load a plug-in, go to view/communication channels/OS. This will bring up a view like the one on the right. From here, you can load and unload dlls, classes and functions. First up is the dll. This can be loaded with one of the buttons on the toolbar. The first one gives access to the cache (dlls that have already been loaded). With the next button, you can select a file from disk. Note that, even though the ‘CmdShell.dll’ file (which contains the scraping functions) is part of the installation, it isn’t guaranteed that it’s already loaded in cache, so you might have to select it from the ‘program files/Chatbot designer pro/’ path. By the way, you can remove a dll by selecting it and pressing delete. Functions can be selected/deselected with the checkbox in front of the name. You can alternatively (de)select the entire class or lib at once. Notice the blue label behind each function name: this is the name that you can use in the patterns. You see, the do-patterns evaluator has no knowledge whatsoever of namespaces, classes or functions, it just knows a single name. This means that all function names should be unique across a single project. If you try to enter a duplicate name, a red box will be displayed round the newly mapped name.

There are quite a few functions available for scraping. Basically though, there are 3 groups: some functions to open/close web-pages, some functions to get data from those opened pages and finally the same functions that don’t require you to first open/close any files but which can do a scrape directly.

Short scrapes

Depending on how much data you need to retrieve, you can use one or the other technique. If there is only 1 xpath that you have to run on a page, then you can probably best use the short/direct functions that don’t require you to first open the web-page. Instead the address is supplied as an argument, together with the xpath. Here’s a list of the available quick scrapers:

Name Arg 1 Arg 2 result
ScrapeText file or web path XPath 0, 1 or more text values
ScrapeInt file or web path XPath 0, 1 or more int values
ScrapeDouble file or web path XPath 0, 1 or more floating point values
ScrapeDate file or web path XPath 0, 1 or more dates

And a short usage example to get the temperature info from the google API for a city that’s defined in ‘$place’:

$value = ScrapeText(“http://www.google.com/ig/api?weather=$place:interleaf(+)”, “/xml_api_reply/weather/current_conditions/temp_c/@data”)

As you can see, the first argument specified the web-page to open. The second is an xpath to the data attribute of the ‘temp_c’ element. Note that we use ‘:interleaf(+)’  cause the google API expects city-names that contain multiple words to be separated with a ‘+’ like: New+York.

More scraping

The second scraping method is primarily useful if you need to run multiple xpaths on the same content. In this case, it’s far more economical to first retrieve the page, run all the queries on the cached file and finally, when done, release it again. This can be accomplished with the remaining scrape functions.

You open a file or webpage with either ‘OpenScraper’ or ‘OpenScraperHTML’. The first works on xml content, the second on html. That is, the second will convert html to xml so that the xpath can be run on it. Both return an integer that needs to be used in subsequent calls. Basically, the integer replaces the filename as a reference. It allows you to have multiple files open and to have the system run multi-threaded and let it serve multiple people at the same time.

The scraping functions themselves are almost identical as the quick versions, except that they take an integer as first argument instead of a path. Other then that, usage is exactly the same, with the same types: one for text, integers, doubles and dates.

Once you are done with the file, you have to call ‘CloseScraper’ with, as argument, the integer that was returned by ‘OpenScraper(HTML)’, so that resources can be cleaned up. This is important, if you forget to do this, the system will eventually buckle, crack and give up.
In a normal usage situation, you would do a short salvo: open a page, do a few scrapes and close it again, all in 1 block, but this is not a requirement, you can keep the page open across multiple inputs. As long as you maintain a reference to the scraper (the integer) somewhere in memory so that you don’t loose track of it.

Html scraping

As already mentioned, html scraping is done by first converting the page into xml before the xpath is executed. This conversion can cause some ‘changes’ in the structure of the file. In other words, the path that you would calculate, based on the html file might not be correct for the xml version. This means that you best build your xpaths based on the xml version of the HTML pages.

The conversion routine that’s internally used by the chatbot designer is based on the SGMLReader library. This provides a command-line tool to manually convert html to xml files. This can be very useful for building the correct query. I’ve included a direct download for the command line html to xml conversion tool. Here’s a short description on how to use it (taken from the original documentation):

sgmlreader <options> [InputUri] [OutputFile]

-e “file” Specifies a file to write error output to. The default is to generate no errors. The special name “$stderr” redirects errors to stderr output stream.
-proxy “server” Specifies the proxy server to use to fetch DTD’s through the fire wall.
-html Specifies that the input is HTML.
-dtd “uri” Specifies some other SGML DTD.
-base Add an HTML base tag to the output.
-pretty Pretty print the output.
-encoding name Specify an encoding for the output file (default UTF-8)
-noxml Stops generation of XML declaration in output.
-doctype Copy <!DOCTYPE tag to the output.
InputUri The input file name or URL. Default is stdin. If this is a local file name then it also supports wildcards.
OutputFile The optional output file name. Default is stdout. If the InputUri contains wildcards then this just specifies the output file extension, the default being “.xml”.

Examples:

sgmlreader -html *.htm *.xml
Converts all .htm files to corresponding .xml files using the built in HTML DTD.

sgmlreader -html http://www.msn.com -proxy myproxy:80 msn.xml
Converts all the MSN home page to XML storing the result in the local file “msn.xml”.

sgmlreader -dtd ofx160.dtd test.ofx ofx.xml
Converts the given OFX file to XML using the SGML DTD “ofx160.dtd” specified in the test.ofx file.

Building an XPath

Once you have your xml file, getting the xpath to the element that you want can still be a little challenging. Html files simply aren’t designed with this type of usage in mind (and hey, if it can be easier for xml files, why not). Enter FireBug, an add-on for Firefox that allows developers to get a closer look at the html…. Or xml. After you have installed firebug and loaded up the xml file into firefox, go to tools/Web developer/Firebug/Open firebug so that you can see the debug panel. In this panel, select the element that you which to query, open the context menu and select ‘copy XPath’. And that’s it, simply paste this path in the chatbot designer and your done.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>