I recently wrote a specialized crawler that was able to navigate a unknown page structure and look for “interesting” content using SlimerJS. The structure is unknown because we crawl different websites, none of them with common id or class attributes.
As a side note, I chose to use that instead of PhantomJS, because with the later I wasn’t able to capture AJAX content.
When started, the crawler receives through the command line a few arguments, i.e. the website URL, together with other information that will help guiding it to the content we are interested on.
It was quite good to learn that you can use xargs, a command available in any UNIX system, to parallelize execution of a given command. On its manual you can find the following:
-P maxprocs Parallel mode: run at most maxprocs invocations of utility at once.
To get it all together, I created a bash script similar to the following to run on a given moment at most
$parallelism_level instances of the crawler:
The script expects an input file to be received, where each line contains all the information required to run the crawler. The function
crawl_site is responsible for preparing and starting this new instance, passing to it any necessary arguments.
After exporting the function, we cat the input file and pipe it to xargs with the necessary flags. xargs then launches a new bash process that call our function and the current line content as argument. Since the passed line is not quoted, if the contents of your line are space separated, they will already become separated arguments on the receiving function.
As soon as one of the running crawlers exits, xargs will automatically launch a new one, till all the lines of the file are finally processed.
- Services Version Lock with Docker and Jenkins
- Integration testing for nginx Routes
- Using Dialyzer with an Elixir Mix Project
- Tips for your Distributed Project Inception or Meeting
- Problems with Branches per Environment
- Efficient Timer using a Circular Buffer
- RFID, Dryers, and IoT
- Trunk Based Development with Multiple Services
- Automatically Build and Deploy the Blog