Web Scraping with HTTPClient in C#
Retrieve and use data from the internet with the HTTPClient
This e-course will teach you how to use the HTTPClient to retrieve information from the Internet and use it for your own application. We will look on what the HTTPClient is, how to do POST and GET requests, parse HTML with the HTML Agility Pack, and how to store the information.
FREE
€ 5,99
This course includes:
8 Hours
Of self-paced video lessons
45 days access
Enjoy this e-course for 1 1/2 month!
(give or take)
45 days access to Discord
Talk with others about this e-course, ask questions, and help others.
Skills you will learn
- Usage of the HTTPClient
- HTML Agility Pack
- Parallel Scraping
- Debugging and Troubleshooting
This e-course is for
- Software Developers
- Fullstack Developer
- Working with data
Topics In This Course
The number of stars indicates the level of focus on the topic.
Introduction to web scraping
Usage of the HTTPClient
Using the HTTPClientFactory
HTML Agility Pack
XPath
Strategy Pattern
Strategy Pattern
Concurrency throttling mechanism
HTTP request resilience
SQLite databases
Dependency injection
Chapters Of This Course
Introduction to this course.
Let’s set some goals of what we are going to learn.
What is web scraping and why do we use it? A small introduction to web scraping before we begin.
Before we start this course, we need to talk about legal and ethical issues. You can’t go scraping all the websites you want. There are rules, terms of service, and even laws.
We are going to make a simple class that holds all the information we will scrape from the websites. Each time we get the information from an online source we need to store the information somewhere. For now, that is in-memory.
The very first thing we need to do is to grab the HTML from the online source we want to scrape. I will start with the page that shows all the blogs on my website.
In the previous chapter, we have retrieved the HTML. Now we need to inspect that HTML and only extract that information we need. We do this with the package HTML Agility Pack, which is built to inspect HTML with XPath.
We now have the data from the website. Let’s show this information on the screen.
Now that we know the basics of building one scraper, let’s add another one! But from a different website. The goal is to only build a new scraper and not change the whole application. Both scrapers (the previous one and the new one) should work along side each other.
The strategy pattern can be used to simply switch between different algorithms without changing the code’s logic. Ideal if you have multiple scrapers and you don’t want huge if-statements.
Request headers store vital information, such as authentication/authorization, the method (POST, GET, PUT, DELETE), the content type, and much more. In this chapter, we will add the user-agent header since this is mostly required by websites.
If you want to scrape a website with different pages you are launching a lot of requests to that one website and/or server. At one point the website and/or server is going to block you. To avoid this we are going to build a mechanism that restricts the number of requests within a specific time window. A throttling system will add delays between the requests.
We are going to add a retry mechanism to our code. We will add a retry counter, check the request response and retry the request when the status code is 429, and repeat this routine until the maximum number of retries is reached.
Grabbing the information from the web pages is cool, but storing the data somewhere would be convenient. Let’s store some information in a SQLite database and use that stored information in our application.
Time to add some performance changes. The code works, no worries there, but there are a few parts we can make better. In this chapter, we will look at socked exhaustion, pipelining, and the HttpClientFactory.
A small recap and some small notes after the course.
Stay up to date with news, deals, new courses, and much more!
Subscribe For Our Newsletter
Stay up to date with news, deals, new courses, and much more!
Frequently Asked Questions
An e-course is a digital course you can follow or take.
An e-course is usually a written course with information on the subject you want to follow. It contains examples (code, images, graphs) and explanations.
E-courses are not live and you can start, pause, continue, and stop whenever you want.
It’s not only text and examples but also testing your knowledge with a quiz at the end of a chapter*. This is done with Kens Learning Paths, a dedicated testing platform where you can test your knowledge and check if you mastered the information.
To take the e-course you first need to create an account. Don’t worry, not much information is needed. With your account,you get your dashboard.
Once you have registered for an e-course, the e-course is added to your dashboard.
Start or continue an e-course from your dashboard.
It depends on your speed. You can go through the e-course when and how you want.
But if you would go berserk on it, you could do it in two days.
Yes and no.
The yes: You will be added to the Discord server. Here you can ask questions to other participants and sometimes a teacher will be online too.
The no: E-mails send to us about question on the subject are not answered. This is done to keep questions centralized to Discord.
But… If you have a problem with the e-course (bug, access problems, stuff like that) we would like you to send an e-mail or place it in our support Discord channel.
Currently not. It is planned in the future. If you finish an e-course, and you stick around, you will get a certificate when it’s available.
You are allowed 45 days for this e-course. It’s not possible to extend this.
* = Some chapters do not include Kens Learning Paths. This is because the chapter doesn’t need one or Kens Learning Path is not ready for it.