Parsowanie dokumentu HTML w .NET

Pewnie nie raz zastanawialiście się jak szybko i bez nadmiernego wysiłku wyciągnąć informacje z dokumentu HTML. Niedomknięty znacznik czy brak apostrofów to już standard w większości stron. XHTML po części rozwiązuje niektóre problemy, ale stron w pełni walidowanych też za dużo nie uświadczymy. Mimo podobieństw większość stron HTML nie  możemy traktować jak dokumentów XML. Platforma .NET nie dostarcza nam narzędzi do parsowania dokumentów HTML, pozostaje nam wiec korzystanie z zewnętrznych bibliotek.

W niniejszym wpisie chciałbym przedstawić bibliotekę którą poznałem już jakiś czas temu: Html Agility Pack, która nie raz już ułatwiła mi pracę. Główną zaletą tej biblioteki jest możliwość poruszanie sie po dokumencie HTML jak po dokumencie XML. Do wybierania elementów dokumentu mozemy uzyci języka XPath lub korzystając z LINQ (od wersji 1.4.0). Sam proces używania i posługiwania się wspomnianą biblioteką jest dość prosty i pokrótce zaprezentują go poniżej.

Oczywiście pierwszą czynnością jaką musimy wykonać aby móc korzystać z tej biblioteki jest dodanie do projektu referencji do pliku HTMLAgilityPack.dll. Po tej czynnosci mozemy korzystac z elementow jakie dostarcza nam biblioteka.

Klasą reprezentującą nasz dokument HTML jest HtmlDocument. Obiekt tej klasy tworzymy korzystając z domyślnego konstruktora.

[csharp]
WebClient client = new WebClient();
string html = client.DownloadString("http://blog.pietowski.com");

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
[/csharp]

Dokument HTML możemy wczytać korzystając z metody Load do której możemy przekazać strumień, obiekt klasy TextReader lub ścieżkę do pliku. Alternatywą jest użycie medody LoadHtml, którą wykorzystałem w powyższym przykładzie, wczytującej dokument bezpośrednio z obiektu klasy System.String. Przed wczytaniem dokumentu możemy ustawić odpowiednie opcje parsowania ustawiając odpowiednie wartości polom o nazwach w formacie:  OptionsXXX.

W celu pobranie błędów parsowania korzystamy z  dostępnej właściwości ParseErrors:

[csharp]
Console.WriteLine("Parse errors:");
foreach(HtmlParseError error in doc.ParseErrors)
{
Console.WriteLine(error.Reason);
}
[/csharp]

Główny węzeł dokumentu dostępny jest pod właściwością DocumentNode korzystając z tego obiektu możemy przeglądać kolejne węzły wczytanego dokumentu. W celu pobrania elementu na podstawie identyfikatora używamy metody GetElementbyId.

[csharp]
HtmlNode blogDescription = doc.GetElementbyId("blog-description");
if(blogDescription != null)
{
Console.WriteLine("Blog description: {0}",blogDescription.InnerText);
}
[/csharp]

Jeśli chcemy wyszukać konkretne węzły w naszym dokumencie możemy skorzystać z LINQ:

[csharp]
IEnumerable<HtmlNode> links = from link in doc.DocumentNode.DescendantNodes()
where link.Name == "a" && link.Attributes["href"] != null
select link;

IEnumerable<HtmlNode> links2 = doc.DocumentNode.DescendantNodes()
.Where(x=>x.Name == "a" && x.Attributes["href"] != null);
[/csharp]

lub wykorzystując język XPath:

[csharp]
HtmlNodeCollection xpathLinks =
doc.DocumentNode.SelectNodes("//a[@href]");

Console.WriteLine("Links:");
foreach(var link in links)
{
Console.WriteLine(link.Attributes["href"].Value);
}
[/csharp]

Najnowszą wersję opisywanej biblioteki można znaleźć na stronie http://htmlagilitypack.codeplex.com/

Projekt demonstrujący wykorzystanie HtmlAgilityPack można pobrać tutaj.

25,115 thoughts on “Parsowanie dokumentu HTML w .NET”

  1. Hello there! This is my first comment here so I just
    wanted to give a quick shout out and say I genuinely enjoy reading your articles.

    Can you suggest any other blogs/websites/forums that go over
    the same subjects? Appreciate it!

  2. Write more, thats all I have to say. Literally, it seems as though you relied on the video to make
    your point. You clearly know what youre talking about, why waste your intelligence on just posting videos
    to your site when you could be giving us something informative
    to read?

  3. Hey! I know this is kinda off topic but I’d figured I’d ask.
    Would you be interested in trading links or maybe guest writing a blog post or vice-versa?
    My blog covers a lot of the same topics as yours and I
    feel we could greatly benefit from each other. If you are interested
    feel free to shoot me an e-mail. I look forward to hearing from you!
    Terrific blog by the way!

  4. Have you ever thought about writing an ebook or guest authoring on other websites?
    I have a blog centered on the same ideas you discuss and would
    really like to have you share some stories/information. I know my viewers would
    enjoy your work. If you are even remotely interested,
    feel free to send me an email.

  5. With havin so much written content do you ever run into any
    issues of plagorism or copyright infringement? My website has a lot of completely
    unique content I’ve either authored myself or outsourced but it seems a
    lot of it is popping it up all over the internet without my permission. Do you know any ways
    to help prevent content from being ripped off? I’d definitely
    appreciate it.

  6. I’m really enjoying the design and layout of your site.
    It’s a very easy on the eyes which makes it much more enjoyable for me to come here and visit more often. Did
    you hire out a designer to create your theme? Outstanding work!

  7. Steel іs sturdy yet the price iѕ ovеr harⅾwood. Eaсh from components are actually readily availaƄle
    in distinct painmt different colors and also they are extra to
    ϲreɑte harmonious style.

  8. Almost eighty percent of those same individuals said they had no trouble remembering
    the particular company that gave them the product.

    It really isn’t that difficult to find one if you know what you’re looking for.
    You can compare products, identify features and specify the customization to become
    made over the internet itself.

  9. This design is incredible! You certainly know how to keep a reader amused.

    Between your wit and your videos, I was almost
    moved to start my own blog (well, almost…HaHa!) Wonderful job.
    I really loved what you had to say, and more than that, how you
    presented it. Too cool!

  10. Hello there! Do you use Twitter? I’d like to follow you if that
    would be okay. I’m absolutely enjoying your blog and look forward to new posts.

  11. I believe everything published made a ton of sense.
    However, consider this, suppose you added a little information? I mean, I don’t wish to tell
    you how to run your website, however suppose you added
    something to maybe get folk’s attention? I mean Parsowanie dokumentu HTML w .NET – pietowski.com is kinda boring.
    You should peek at Yahoo’s front page and watch how they create post titles to grab people to open the links.
    You might add a video or a related picture or two to grab people interested about everything’ve got to say.
    Just my opinion, it could bring your blog a little livelier.

  12. Hi! I know this is kinda off topic however , I’d figured I’d ask.
    Would you be interested in trading links or maybe guest writing a blog article or vice-versa?
    My website discusses a lot of the same topics as yours and I believe we could greatly benefit from
    each other. If you might be interested feel free to send me
    an e-mail. I look forward to hearing from you! Great blog by
    the way!

  13. Howdy very nice blog!! Guy .. Beautiful .. Superb .. I will bookmark your website and
    take the feeds also? I am glad to seek out so
    many useful information here in the put up, we want work out extra techniques on this regard, thank
    you for sharing. . . . . .

  14. Hi! I just wanted to ask if you ever have any problems with hackers?
    My last blog (wordpress) was hacked and I ended up losing a
    few months of hard work due to no back up. Do you have any
    methods to protect against hackers?

  15. Our latest addition to the limousine fleet is the 7 passenger Cadillac Escalade.

    Once you do that take a spray bottle filled with 4 cups of warm water and a tablespoon of detergent and spray the stain. Listed
    here are some interesting facts about Mercedes-Benz.

  16. Every weekend i used to pay a visit this website, because i wish for enjoyment,
    for the reason that this this web site conations
    actually fastidious funny data too.

  17. Hmm it looks like your website ate my first comment (it was extremely long) so I guess
    I’ll just sum it up what I wrote and say, I’m thoroughly enjoying your blog.

    I too am an aspiring blog blogger but I’m still new to everything.
    Do you have any points for newbie blog writers? I’d certainly
    appreciate it.

  18. Pretty component to content. I just stumbled upon your web site and in accession capital
    to claim that I get actually loved account your blog
    posts. Any way I will be subscribing to your feeds and even I success you access persistently
    quickly.

  19. Thanks a bunch for sharing this with all people you actually recognize what you’re speaking approximately!

    Bookmarked. Kindly additionally discuss with my web site =).
    We could have a hyperlink exchange agreement
    between uscheap nba jerseys

  20. I lovwd as much as you’ll receive carried out right here.
    The sketch is tasteful, your authored material stylish.
    nonetheless, you command get bought ann edginess over that you wish be delivering the following.
    unwell unquestionably come more formerly
    again as exactly the same newrly very often inside case you shield this increase.

  21. Here are the top 3 positions that you can use to satisfy your loved once.
    There’s a loud crash because even at only 10 mile
    per hour, some mannequins are going to meet their end. Steve
    Silver Company has the alternative to your superior challenge.

  22. Some investing focused media outlets will substitute the
    S&P 500 index for the Dow Jones Industrial Average when they announce which direction the market went.
    ONE ‘Like’ can turn into 20 ‘Likes’ without any extra work
    just because you have written a great news article.
    This is no reflection of the quality of your goods — we are simply going in a different direction.

  23. I needed to thank you for this fantastic read!!

    I certainly loved every little bit of it. I have you book-marked to look at new stuff you post…

  24. Great post. I used to be checking constantly this weblog and I am impressed!

    Very helpful information specially the closing phase 🙂 I maintain such information much.
    I used to be seeking this particular info for a long time.
    Thanks and best of luck.

  25. Una cosa! La respuesta de Biel y la mia han entrado a la misma hora! No pretendo quitarle la entrada! pero quizas se debiera valorar este “pequeño-gran” detalle, pues quizas entró su nombre antes por comenzar con B y yo con I! y se deba repartir 2 y 2 o yo que se!!! `pero cronológicamente, estan a la par!!!Gracias (y disculpa Biel ) insisto que no quiero robarte el premio

  26. Hey there would you mind letting me know which hosting company
    you’re working with? I’ve loaded your blog in 3 completely different
    browsers and I must say this blog loads a lot faster then most.
    Can you suggest a good web hosting provider at a fair price?
    Thanks a lot, I appreciate it!

Leave a Reply