[ACCEPTED]-headless internet browser?-webautomation
Here are a list of headless browsers that 14 I know about:
- HtmlUnit - Java. Custom browser engine. Limited JavaScript support/DOM emulated. Open source.
- Ghost - Python only. WebKit-based. Full JavaScript support. Open source.
- Twill - Python/command line. Custom browser engine. No JavaScript. Open source.
- PhantomJS - Command line/all platforms. WebKit-based. Full JavaScript support. Open source.
- Awesomium - C++/.NET/all platforms. Chromium-based. Full JavaScript support. Commercial/free.
- SimpleBrowser - .NET 4/C#. Custom browser engine. No JavaScript support. Open source.
- ZombieJS - Node.js. Custom browser engine. JavaScript support/emulated DOM. Open source. Based on jsdom.
- EnvJS - JavaScript via Java/Rhino. Custom browser engine. JavaScript support/emulated DOM. Open source.
- Watir-webdriver with headless gem - Ruby via WebDriver. Full JS Support via Browsers (Firefox/Chrome/Safari/IE).
- Spynner - Python only. PyQT and WebKit.
- jsdom - Node.js. Custom browser engine. Supports JS via emulated DOM. Open source.
- TrifleJS - port of PhantomJS using MSIE (Trident) and V8. Open source.
- ui4j - Pure Java 8 solution. A wrapper library around the JavaFx WebKit Engine incl. headless modes.
- Chromium Embedded Framework - Full up-to-date embedded version of Chromium with off-screen rendering as needed. C/C++, with .NET wrappers (and other languages). As it is Chromium, it has support for everything. BSD licensed.
- Selenium WebDriver - Full support for JavaScript via browsers (Firefox, IE, Chrome, Safari, Opera). Officially supported bindings are C#, Java, JavaScript, Haskell, Perl, Ruby, PHP, Python, Objective-C, and R. Unofficial bindings are available for Qt and Go. Open source.
Headless browsers that have 13 JavaScript support via an emulated DOM generally 12 have issues with some sites that use more 11 advanced/obscure browser features, or have 10 functionality that has visual dependencies 9 (e.g. via CSS positions and so forth), so 8 whilst the pure JavaScript support in these 7 browsers is generally complete, the actual 6 supported browser functionality should be 5 considered as partial only.
(Note: Original 4 version of this post only mentioned HtmlUnit, hence 3 the comments. If you know of other headless 2 browser implementations and have edit rights, feel 1 free to edit this post and add them.)
Check out twill, a very convenient scripting 3 language for precisely what you're looking 2 for. From the examples:
setlocal username <your username>
setlocal password <your password>
go http://www.slashdot.org/
formvalue 1 unickname $username
formvalue 1 upasswd $password
submit
code 200 # make sure form submission is correct!
There's also a Python API if you're 1 looking for more flexibility.
Have a look at PhantomJS, a JavaScript based automation framework available for Windows, Mac 3 OS X, Linux, other *ix systems.
Using PhantomJS, you 2 can do things like this:
console.log('Loading a web page');
var page = new WebPage();
var url = "http://www.phantomjs.org/";
page.open(url, function (status) {
// perform your task once the page is ready ...
phantom.exit();
});
Or evaluate a page's title:
var page = require('webpage').create();
page.open(url, function (status) {
var title = page.evaluate(function () {
return document.title;
});
console.log('Page title is ' + title);
});
Examples 1 from PhantomJS' Quickstart page. You can even render a page to a PNG, JPEG or PDF using the render() method.
I once did that using the Internet Explorer 6 ActiveX control (WebBrowser, MSHTML). You 5 can instantiate it without making it visible.
This 4 can be done with any language which supports 3 COM (Delphi, VB6, VB.net, C#, C++, ...)
Of 2 course this is a quick-and-dirty solution 1 and might not be appropriate in your situation.
PhantomJS is a headless WebKit-based browser 1 that you can script with JavaScript.
Except for the auto-download of the file 6 (as that is a dialog box) a win form with 5 the embedded webcontrol will do this.
You 4 could look at Watin and Watin Recorder. They may help with 3 C# code that can login to your website, navigate 2 to a URL and possibly even help automate 1 the file download.
YMMV though.
If the links are known (e.g, you don't have 8 to search the page for them), then you can 7 probably use wget
. I believe that it will do 6 the state management across multiple fetches.
If 5 you are a little more enterprising, then 4 I would delve into the new goodies in Python 3.0. They 3 redid the interface to their HTTP stack 2 and, IMHO, have a very nice interface that is susceptible to 1 this type of scripting.
Node.js with YUI on the server. Check out 5 this video: http://www.yuiblog.com/blog/2010/09/29/video-glass-node/
The guy in this video Dav Glass 4 shows an example of how he uses node to 3 fetch a page from Digg. He then attached 2 YUI to the DOM he grabbed and can completely 1 manipulate it.
Also you can use Live Http Headers (Firefox 6 extension) to record headers which are sent 5 to site (Login -> Links -> Download Link) and 4 then replicate them with php using fsockopen. Only 3 thing which you'll probably need to variate 2 is the cookie's value which you receive 1 from login page.
Can you not just use a download manager?
There's 23 better ones, but FlashGet has browser-integration, and 22 supports authentication. You can login, click 21 a bunch of links and queue them up and schedule 20 the download.
You could write something that, say, acts 19 as a proxy which catches specific links 18 and queues them for later download, or a 17 Javascript bookmarklet that modifies links 16 to go to "http://localhost:1234/download_queuer?url=" + $link.href
and have that queue the downloads 15 - but you'd be reinventing the download-manager-wheel, and 14 with authentication it can be more complicated..
Or, if 13 you want the "login, click links" bit 12 to be automated also - look into screen-scraping.. Basically 11 you load the page via a HTTP library, find 10 the download links and download them..
Slightly 9 simplified example, using Python:
import urllib
from BeautifulSoup import BeautifulSoup
src = urllib.urlopen("http://%s:%s@example.com" % ("username", "password"))
soup = BeautifulSoup(src)
for link_tag in soup.findAll("a"):
link = link_tag["href"]
filename = link.split("/")[-1] # get everything after last /
urllib.urlretrieve(link, filename)
That would 8 download every link on example.com, after 7 authenticating with the username/password 6 of "username" and "password". You 5 could, of course, find more specific links 4 using BeautifulSoup's HTML selector's (for example, you 3 could find all links with the class "download", or 2 URL's that start with http://cdn.example.com
).
You could do the 1 same in pretty much any language..
.NET contains System.Windows.Forms.WebBrowser. You can create an instance 7 of this, send it to a URL, and then easily 6 parse the html on that page. You could 5 then follow any links you found, etc.
I 4 have worked with this object only minimally, so 3 I'm no expert, but if you're already familiar 2 with .NET then it would probably be worth 1 looking into.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.