[ACCEPTED]-Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)-web-scraping
Attempt to use XHR2 responseType = "document"
and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))
with 3 my text/html
patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document
support 2 (synchronously checking response === null
on an object URL 1 created from a text/html
blob).
Use the Chrome WebRequest API to hide X-Requested-With
, etc. headers.
If you are fine looking at something beyond 19 a Google Chrome Plugin, look at phantomjs which uses 18 Qt-Webkit in background and runs just like 17 a browser incuding making ajax requests. You 16 can call it a headless browser as it doesn't 15 display the output on a screen and can quitely 14 work in background while you are doing other 13 stuff. If you want, you can export out images, pdf 12 out of the pages it fetches. It provides 11 JS interface to load pages, clicking on 10 buttons etc much like you have in a browser. You 9 can also inject custom JS for example jQuery 8 on any of the pages you want to scrape and 7 use it to access the dom and export out 6 desired data. As its using Webkit its rendering 5 behaviour is exactly like Google Chrome.
Another 4 option would be to use Aptana Jaxer which is 3 based on Mozilla Engine and is very good 2 concept in itself. It can be used as a simple 1 scraping tool as well.
A lot of tools have been released since 5 this question was asked.
artoo.js is one of them. It's 4 a piece of JavaScript code meant to be run 3 in your browser's console to provide you 2 with some scraping utilities. It can also 1 be used as a chrome extension.
Web scraping is kind of convoluted in a 1 Chrome Extension. Some points:
- You run content scripts for access to the DOM.
- Background pages (one per browser) can send and receive messages to content scripts. That is, you can run a content script that sets up an RPC endpoint and fires a specified callback in the context of the background page as a response.
- You can execute content scripts in all frames of a webpage, then stitch the document tree (composed of the 1..N frames that the page contains) together.
- As S.K. suggested, your background page can send the data as an XMLHttpRequest to some kind of lightweight HTTP server that listens locally.
I'm not sure it's entirely possible with 8 just JavaScript, but if you can set up a 7 dedicated PHP script for your extension 6 that uses cURL to fetch the HTML for a page, the 5 PHP script could scrape the page for you 4 and your extension could read it in through 3 an AJAX request.
The actual page being scraped 2 wouldn't know it's an AJAX request, however, because 1 it is being accessed through cURL.
I think you can start from this example.
So basically 7 you can try using Extension + Plugin combination. Extension 6 would have access to DOM (including plugin) and 5 drive the process. And Plugin would send 4 actual HTTP requests.
I can recommend using 3 Firebreath as a crossplatform Chrome/Firefox 2 plugin platform, in particular take a look 1 at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper
couldn't you just do some iframe trickery? if 3 you load the url into a dedicated frame, you 2 have the dom in a document object and can 1 do your jquery selections, no?
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.