Parsowanie dokumentu HTML w .NET

Pewnie nie raz zastanawialiście się jak szybko i bez nadmiernego wysiłku wyciągnąć informacje z dokumentu HTML. Niedomknięty znacznik czy brak apostrofów to już standard w większości stron. XHTML po części rozwiązuje niektóre problemy, ale stron w pełni walidowanych też za dużo nie uświadczymy. Mimo podobieństw większość stron HTML nie  możemy traktować jak dokumentów XML. Platforma .NET nie dostarcza nam narzędzi do parsowania dokumentów HTML, pozostaje nam wiec korzystanie z zewnętrznych bibliotek.

W niniejszym wpisie chciałbym przedstawić bibliotekę którą poznałem już jakiś czas temu: Html Agility Pack, która nie raz już ułatwiła mi pracę. Główną zaletą tej biblioteki jest możliwość poruszanie sie po dokumencie HTML jak po dokumencie XML. Do wybierania elementów dokumentu mozemy uzyci języka XPath lub korzystając z LINQ (od wersji 1.4.0). Sam proces używania i posługiwania się wspomnianą biblioteką jest dość prosty i pokrótce zaprezentują go poniżej.

Oczywiście pierwszą czynnością jaką musimy wykonać aby móc korzystać z tej biblioteki jest dodanie do projektu referencji do pliku HTMLAgilityPack.dll. Po tej czynnosci mozemy korzystac z elementow jakie dostarcza nam biblioteka.

Klasą reprezentującą nasz dokument HTML jest HtmlDocument. Obiekt tej klasy tworzymy korzystając z domyślnego konstruktora.

[csharp]
WebClient client = new WebClient();
string html = client.DownloadString("http://blog.pietowski.com");

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
[/csharp]

Dokument HTML możemy wczytać korzystając z metody Load do której możemy przekazać strumień, obiekt klasy TextReader lub ścieżkę do pliku. Alternatywą jest użycie medody LoadHtml, którą wykorzystałem w powyższym przykładzie, wczytującej dokument bezpośrednio z obiektu klasy System.String. Przed wczytaniem dokumentu możemy ustawić odpowiednie opcje parsowania ustawiając odpowiednie wartości polom o nazwach w formacie:  OptionsXXX.

W celu pobranie błędów parsowania korzystamy z  dostępnej właściwości ParseErrors:

[csharp]
Console.WriteLine("Parse errors:");
foreach(HtmlParseError error in doc.ParseErrors)
{
Console.WriteLine(error.Reason);
}
[/csharp]

Główny węzeł dokumentu dostępny jest pod właściwością DocumentNode korzystając z tego obiektu możemy przeglądać kolejne węzły wczytanego dokumentu. W celu pobrania elementu na podstawie identyfikatora używamy metody GetElementbyId.

[csharp]
HtmlNode blogDescription = doc.GetElementbyId("blog-description");
if(blogDescription != null)
{
Console.WriteLine("Blog description: {0}",blogDescription.InnerText);
}
[/csharp]

Jeśli chcemy wyszukać konkretne węzły w naszym dokumencie możemy skorzystać z LINQ:

[csharp]
IEnumerable<HtmlNode> links = from link in doc.DocumentNode.DescendantNodes()
where link.Name == "a" && link.Attributes["href"] != null
select link;

IEnumerable<HtmlNode> links2 = doc.DocumentNode.DescendantNodes()
.Where(x=>x.Name == "a" && x.Attributes["href"] != null);
[/csharp]

lub wykorzystując język XPath:

[csharp]
HtmlNodeCollection xpathLinks =
doc.DocumentNode.SelectNodes("//a[@href]");

Console.WriteLine("Links:");
foreach(var link in links)
{
Console.WriteLine(link.Attributes["href"].Value);
}
[/csharp]

Najnowszą wersję opisywanej biblioteki można znaleźć na stronie http://htmlagilitypack.codeplex.com/

Projekt demonstrujący wykorzystanie HtmlAgilityPack można pobrać tutaj.

25,115 thoughts on “Parsowanie dokumentu HTML w .NET”

  1. I am really loving the theme/design of your
    site. Do you ever run into any internet browser compatibility issues?
    A few of my blog readers have complained about
    my website not working correctly in Explorer but looks great in Opera.
    Do you have any advice to help fix this problem?

  2. There are some attention-grabbing time limits in this article but I don抰 know if I see all of them center to heart. There’s some validity but I’ll take maintain opinion till I look into it further. Good article , thanks and we would like extra! Added to FeedBurner as well

  3. The sisters and even mother mayy have ressed up the boy onc in a while oor play dess
    up and tease him (but thbat is not thhe defining factor).
    Wearing church hats is a very old practice, and many ladies still wear them in mass.

    In the traditional wedding dress of Lehengas, halter-tops are
    nnow wiidely used which have taken a place inn Indian women. There are many sects of Judaism, each adeing to the culture of Israel.
    They are usuakly made tto fiit your petite frame just
    nicely; noot too tight and not too big. Also when buying online you
    will find hat you gget more information att your fingertips thjan if you were walking into
    a fashion shop, becaus youu have to go oon the information they
    provide to you on whether the iyem is the right choice
    or not.

  4. My developer is trying to convince me to move to .net from PHP.
    I have always disliked the idea because of the costs.

    But he’s tryiong none the less. I’ve been using
    Movable-type on a variety of websites for about a year and am concerned about switching
    to another platform. I have heard fantastic things about blogengine.net.

    Is there a way I can import all my wordpress content into it?

    Any kind of help would be greatly appreciated!

  5. Somebody essentially assist to make severely articles I would
    state. That is the very first time I frequented your website page and
    thus far? I amazed with the analysis you made to create this particular submit extraordinary.
    Great process!

  6. For their low cost position and exhilarating activity, mini quadcopters may also be well suited for beginning users who’re looking to swim their feet into the remote-controlled aircraft swimming.

  7. The waves around Oceanside are basic beach-break dunes, along with the form
    of the trend is decided alot by what sort of swell we’ve inside the water at time.

  8. Thanks for some other wonderful article.
    The place else may just anybody get that kind of information in such a perfect method of writing?
    I have a presentation subsequent week, and I’m on the look for such information.

  9. Ich finde, ein sehr schöner Artikel, der die Verhältnisse und Größenordnungen zurechtrückt. Zudem kommen mir die Protestierer nicht so vor, als hätte ich jemals mit ihnen einen gemeinsamen Protestmarsch gemacht. Nicht einmal ein Stück weit …

  10. Hоwɗy! Someone inn myy Mʏspace group shaгed this site with us
    so I came tо look it over. I’m definitely enjoying the іnformation. I’m bookmarking аnd will be tweeting this to my foⅼlowers!
    Excеlⅼent blog and fanntastic desiɡn and style.

  11. hello!,I love your writing so a lot! proportion we keep
    in touch more approximately your post on AOL? I need a specialist in this area to unravel my problem.
    Maybe that is you! Looking forward to see you.

  12. You are so cool! I do not think I have read anything like that before.
    So great to find somebody with a few genuine thoughts on this
    subject. Seriously.. many thanks for starting this up.
    This website is something that is needed on the web,
    someone with a bit of originality!
    cheap MLB jerseys

  13. Greetings! I know this is somewhat off topic but I was wondering if you knew where I could locate a captcha
    plugin for my comment form? I’m using the same blog
    platform as yours and I’m having problems finding one? Thanks a lot!

  14. Excellent pieces. Keep writing such kind of information on your blog.
    Im really impressed by your blog.
    Hi there, You’ve done a great job. I will certainly digg it
    and for my part recommend to my friends. I am sure they will be
    benefited from this website.

  15. Hi! This is my first visit to your blog! We are a
    group of volunteers and starting a new initiative in a community
    in the same niche. Your blog provided us beneficial information to
    work on. You have done a marvellous job!

  16. Thanks , I have just been looking for information approximately this subjct for a while and yours is
    the best I have came upon so far. However, what aabout the bottolm line?

    Are you positive concderning the supply?

  17. I was recommended this blog by way of my cousin.
    I am now not certain whether this publish is written by him as nobody else
    understand such unique about my trouble. You’re incredible!
    Thank you!

  18. It is the best time to make a few plans for the future and it’s
    time to be happy. I’ve read this submit and if I may just I want to counsel you some attention-grabbing things or suggestions.
    Perhaps you can write subsequent articles referring to this article.

    I desire to read more things about it!

  19. Somebody necessarily assist to make severely articles I would state.
    That is the very first time I frequented your website page and up to now?
    I surprised with the research you made to make this actual publish amazing.
    Wonderful process!

  20. Your style is very unique compared to other folks
    I’ve read stuff from. Many thanks for posting
    when you’ve got the opportunity, Guess I’ll just bookmark this web site.

  21. Great article! This is the type of info that should be
    shared around the internet. Shame on the search engines for no longer
    positioning this publish upper! Come on over and visit my website .
    Thanks =)

  22. Still, the initial failures of a struggling comedian drove Jim Carrey into depression in the 1980s.

    Not a single soul from the crowd will ever fail to notice their glamour and appeal.
    Each of the dresses have photos and a description about them.

  23. Nice post. I used to be checking constantly this
    blog and I’m impressed! Very helpful information particularly the closing phase :
    ) I take care of such information much. I was looking for this certain information for a very
    lengthy time. Thanks and best of luck.

  24. Greetings from Ohio! I’m bored to death at work so I decided to check out your
    blog on my iphone during lunch break. I love the info you present here and can’t wait to
    take a look when I get home. I’m surprised at how quick your blog loaded
    on my cell phone .. I’m not even using WIFI, just 3G ..
    Anyhow, awesome site!

  25. I’m curious to find out what blog platform you have been working
    with? I’m having some minor security problems with my latest blog and I’d like to find something more safeguarded.
    Do you have any suggestions?

  26. You need to bbe a parrt oof a comtest for onee of thee most useful websites on the internet.
    I am going to highly recommend this site!

  27. Every weekend i used to pay a visit this site,
    for the reason that i want enjoyment, since this this site conations truly pleasant
    funny stuff too.

Leave a Reply