[ACCEPTED]-How to programmatically log in to a website to screenscape?-web-scraping
You'd make the request as though you'd just 8 filled out the form. Assuming it's POST 7 for example, you make a POST request with 6 the correct data. Now if you can't login 5 directly to the same page you want to scrape, you 4 will have to track whatever cookies are 3 set after your login request, and include 2 them in your scraping request to allow you 1 to stay logged in.
It might look like:
HttpWebRequest http = WebRequest.Create(url) as HttpWebRequest;
http.KeepAlive = true;
http.Method = "POST";
http.ContentType = "application/x-www-form-urlencoded";
string postData="FormNameForUserId=" + strUserId + "&FormNameForPassword=" + strPassword;
byte[] dataBytes = UTF8Encoding.UTF8.GetBytes(postData);
http.ContentLength = dataBytes.Length;
using (Stream postStream = http.GetRequestStream())
{
postStream.Write(dataBytes, 0, dataBytes.Length);
}
HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
// Probably want to inspect the http.Headers here first
http = WebRequest.Create(url2) as HttpWebRequest;
http.CookieContainer = new CookieContainer();
http.CookieContainer.Add(httpResponse.Cookies);
HttpWebResponse httpResponse2 = http.GetResponse() as HttpWebResponse;
Maybe.
You can use a WebBrowser control. Just feed it the 9 URL of the site, then use the DOM to set 8 the username and password into the right 7 fields, and eventually send a click to the 6 submit button. This way you don't care about 5 anything but the two input fields and the 4 submit button. No cookie handling, no raw 3 HTML parsing, no HTTP sniffing - all that 2 is done by the browser control.
If you go 1 that way, a few more suggestions:
- You can prevent the control from loading add-ins such as Flash - could save you some time.
- Once you login, you can obtain whatever information you need from the DOM - no need to parse raw HTML.
- If you want to make the tool even more portable in case the site changes in the future, you can replace your explicit DOM manipulation with an injection of JavaScript. The JS can be obtained from an external resource, and once called it can do the fields population and the submit.
For some cases, httpResponse.Cookies
will be blank. Use the 1 CookieContainer
instead.
CookieContainer cc = new CookieContainer();
HttpWebRequest http = WebRequest.Create(url) as HttpWebRequest;
http.KeepAlive = true;
http.Method = "POST";
http.ContentType = "application/x-www-form-urlencoded";
http.CookieContainer = cc;
string postData="FormNameForUserId=" + strUserId + "&FormNameForPassword=" + strPassword;
byte[] dataBytes = UTF8Encoding.UTF8.GetBytes(postData);
http.ContentLength = dataBytes.Length;
using (Stream postStream = http.GetRequestStream())
{
postStream.Write(dataBytes, 0, dataBytes.Length);
}
HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
// Probably want to inspect the http.Headers here first
http = WebRequest.Create(url2) as HttpWebRequest;
http.CookieContainer = cc;
HttpWebResponse httpResponse2 = http.GetResponse() as HttpWebResponse;
As an addition to dlambin answer It is necessary 3 to have
http.AllowAutoRedirect=false;
Otherwise
HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
It will make another request 2 to initial url and you won't be able to 1 retrieve url2.
You need to use the HTTPWebRequest and do 5 a POST. This link should help you get started. The 4 key is, you need to look at the HTML Form 3 of the page you're trying to post from to 2 see all the parameters the form needs in 1 order to submit the post.
http://www.netomatix.com/httppostdata.aspx
http://geekswithblogs.net/rakker/archive/2006/04/21/76044.aspx
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.