Downloading pdf file using WebRequests

Asked 10/8, 2012 at 12:9 Answered 27/12, 2017 at 7:24

I'm trying to download a number of pdf files automagically given a list of urls.

Here's the code I have:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

request.Method = "GET";

var encoding = new UTF8Encoding();

request.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-gb,en;q=0.5");
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");

request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0";

HttpWebResponse resp = (HttpWebResponse)request.GetResponse();

BinaryReader reader = new BinaryReader(resp.GetResponseStream());

FileStream stream = new FileStream("output/" + date.ToString("yyyy-MM-dd") + ".pdf",FileMode.Create);

BinaryWriter writer = new BinaryWriter(stream);

while (reader.PeekChar() != -1)
      {
       writer.Write(reader.Read());
      }
       writer.Flush();
       writer.Close();

So, I know the first part works. I was originally getting it and reading it using a TextReader - but that gave me corrupted pdf files (since pdfs are binary files).

Right now if I run it, reader.PeekChar() is always -1 and nothing happens - I get an empty file.

While debugging it, I noticed that reader.Read() was actually giving different numbers when I was invoking it - so maybe Peek is broken.

So I tried something very dirty

try
{
 while (true)
   {
    writer.Write(reader.Read());
    }
 }
   catch
      {
      }
 writer.Flush();
 writer.Close();

Now I'm getting a very tiny file with some garbage in it, but its still not what I'm looking for.

So, anyone can point me in the right direction?

Additional Information:

The header doesn't suggest its compressed or anything else.

HTTP/1.1 200 OK
Content-Type: application/pdf
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Fri, 10 Aug 2012 11:15:48 GMT
Content-Length: 109809

Displacement answered 10/8, 2012 at 12:9 Comment(0)

Skip the BinaryReader and BinaryWriter and just copy the input stream to the output FileStream. Briefly

var fileName = "output/" + date.ToString("yyyy-MM-dd") + ".pdf";
using (var stream = File.Create(fileName))
  resp.GetResponseStream().CopyTo(stream);

Lecythus answered 10/8, 2012 at 12:11 Comment(4)

I wonder if there is a way to get this into a byte array instead of sending it to the file system? – Juliusjullundur 24/8, 2015 at 20:12

@ioSamurai: Replace File.Create(filename) with new MemoryStream() and then at the end of the using block retrieve the bytes: var bytes = stream.ToArray(). A MemoryStream does not use any unmanaged resources so you can also drop the using block entirely. – Lecythus 24/8, 2015 at 20:33

@MartinLiversage hmm I have tried this a few times and while I do get a byte stream, when I ultimately write it to disk the pdf file is corrupt... however making the same request from the browser (I am using WebRequest in code) gives the PDF file fine. This may actually be some strange behavior related to how Report Server serves up PDF responses to web requests... – Juliusjullundur 24/8, 2015 at 20:40

@ioSamurai: I am pretty sure that the few lines of code I have provided does not corrupt a PDF file and I would be surprised if Report Server has a "strange behavior". To troubleshoot you can compare the first few bytes of the file and the length of the file using both your own code, a tool like Fiddler to see the stream in transit and the file retrieved using a web browser. – Lecythus 24/8, 2015 at 20:54

Why not use the WebClient class?

using (WebClient webClient = new WebClient())
{
    webClient.DownloadFile("url", "filePath");
}

Orissa answered 10/8, 2012 at 12:12 Comment(2)

I needed to be able to change the request headers. – Displacement 10/8, 2012 at 12:17

@Aabela, yeah, please take a look at WebClient.Headers Property. – Orissa 10/8, 2012 at 12:19

Your question asks about WebClient but your code shows you using Raw HTTP Requests & Resposnses.

Why don't you actually use the System.Net.WebClient ?

using(System.Net.WebClient wc = new WebClient()) 
{
    wc.DownloadFile("http://www.site.com/file.pdf",  "C:\\Temp\\File.pdf");
}

Dwelt answered 10/8, 2012 at 12:12 Comment(2)

Sorry, fixed original question. The reason I went for raw HTTP requests/response is because I need to modify the headers myself. – Displacement 10/8, 2012 at 12:16

yep. it does that too. just saw your comment below. live and learn :-) – Dwelt 10/8, 2012 at 12:27

        private void Form1_Load(object sender, EventArgs e)
        {
  
            WebClient webClient = new WebClient();
            webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
            webClient.DownloadProgressChanged += new DownloadProgressChangedEventHandler(ProgressChanged);
            webClient.DownloadFileAsync(new Uri("https://www.colorado.gov/pacific/sites/default/files/Income1.pdf"), @"output/" + DateTime.Now.Ticks ("")+ ".pdf", FileMode.Create);
        }

        private void ProgressChanged(object sender, DownloadProgressChangedEventArgs e)
        {
            progressBar = e.ProgressPercentage;
        }

        private void Completed(object sender, AsyncCompletedEventArgs e)
        {
            MessageBox.Show("Download completed!");
        }
    }
}

Reichstag answered 27/12, 2017 at 7:24 Comment(0)

Recommended topics

Hot tags