Pewnie nie raz zastanawialiście się jak szybko i bez nadmiernego wysiłku wyciągnąć informacje z dokumentu HTML. Niedomknięty znacznik czy brak apostrofów to już standard w większości stron. XHTML po części rozwiązuje niektóre problemy, ale stron w pełni walidowanych też za dużo nie uświadczymy. Mimo podobieństw większość stron HTML nie możemy traktować jak dokumentów XML. Platforma .NET nie dostarcza nam narzędzi do parsowania dokumentów HTML, pozostaje nam wiec korzystanie z zewnętrznych bibliotek.
W niniejszym wpisie chciałbym przedstawić bibliotekę którą poznałem już jakiś czas temu: Html Agility Pack, która nie raz już ułatwiła mi pracę. Główną zaletą tej biblioteki jest możliwość poruszanie sie po dokumencie HTML jak po dokumencie XML. Do wybierania elementów dokumentu mozemy uzyci języka XPath lub korzystając z LINQ (od wersji 1.4.0). Sam proces używania i posługiwania się wspomnianą biblioteką jest dość prosty i pokrótce zaprezentują go poniżej.
Oczywiście pierwszą czynnością jaką musimy wykonać aby móc korzystać z tej biblioteki jest dodanie do projektu referencji do pliku HTMLAgilityPack.dll. Po tej czynnosci mozemy korzystac z elementow jakie dostarcza nam biblioteka.
Klasą reprezentującą nasz dokument HTML jest HtmlDocument. Obiekt tej klasy tworzymy korzystając z domyślnego konstruktora.
[csharp]
WebClient client = new WebClient();
string html = client.DownloadString("http://blog.pietowski.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
[/csharp]
Dokument HTML możemy wczytać korzystając z metody Load do której możemy przekazać strumień, obiekt klasy TextReader lub ścieżkę do pliku. Alternatywą jest użycie medody LoadHtml, którą wykorzystałem w powyższym przykładzie, wczytującej dokument bezpośrednio z obiektu klasy System.String. Przed wczytaniem dokumentu możemy ustawić odpowiednie opcje parsowania ustawiając odpowiednie wartości polom o nazwach w formacie: OptionsXXX.
W celu pobranie błędów parsowania korzystamy z dostępnej właściwości ParseErrors:
[csharp]
Console.WriteLine("Parse errors:");
foreach(HtmlParseError error in doc.ParseErrors)
{
Console.WriteLine(error.Reason);
}
[/csharp]
Główny węzeł dokumentu dostępny jest pod właściwością DocumentNode korzystając z tego obiektu możemy przeglądać kolejne węzły wczytanego dokumentu. W celu pobrania elementu na podstawie identyfikatora używamy metody GetElementbyId.
[csharp]
HtmlNode blogDescription = doc.GetElementbyId("blog-description");
if(blogDescription != null)
{
Console.WriteLine("Blog description: {0}",blogDescription.InnerText);
}
[/csharp]
Jeśli chcemy wyszukać konkretne węzły w naszym dokumencie możemy skorzystać z LINQ:
[csharp]
IEnumerable<HtmlNode> links = from link in doc.DocumentNode.DescendantNodes()
where link.Name == "a" && link.Attributes["href"] != null
select link;
IEnumerable<HtmlNode> links2 = doc.DocumentNode.DescendantNodes()
.Where(x=>x.Name == "a" && x.Attributes["href"] != null);
[/csharp]
lub wykorzystując język XPath:
[csharp]
HtmlNodeCollection xpathLinks =
doc.DocumentNode.SelectNodes("//a[@href]");
Console.WriteLine("Links:");
foreach(var link in links)
{
Console.WriteLine(link.Attributes["href"].Value);
}
[/csharp]
Najnowszą wersję opisywanej biblioteki można znaleźć na stronie http://htmlagilitypack.codeplex.com/
Projekt demonstrujący wykorzystanie HtmlAgilityPack można pobrać tutaj.
Hello there! This is my first comment here so I just
wanted to give a quick shout out and say I genuinely enjoy reading your articles.
Can you suggest any other blogs/websites/forums that go over
the same subjects? Appreciate it!
WS
Wonderful post however , I was wanting to know if you could write a litte more on this subject?
I’d be very thankful if you could elaborate a
little bit more. Cheers!
Write more, thats all I have to say. Literally, it seems as though you relied on the video to make
your point. You clearly know what youre talking about, why waste your intelligence on just posting videos
to your site when you could be giving us something informative
to read?
Hey! I know this is kinda off topic but I’d figured I’d ask.
Would you be interested in trading links or maybe guest writing a blog post or vice-versa?
My blog covers a lot of the same topics as yours and I
feel we could greatly benefit from each other. If you are interested
feel free to shoot me an e-mail. I look forward to hearing from you!
Terrific blog by the way!
Have you ever thought about writing an ebook or guest authoring on other websites?
I have a blog centered on the same ideas you discuss and would
really like to have you share some stories/information. I know my viewers would
enjoy your work. If you are even remotely interested,
feel free to send me an email.
I was able to find good information from your articles.
Your mode of explaining everything in this article is truly fastidious, every one be able to simply
understand it, Thanks a lot.
always i used to read smaller content that as well clear their motive, and that is
also happening with this article which I am reading here.
Litters average 1 to 3 kittens.
Incredible quest there. What happened after? Good luck!
With havin so much written content do you ever run into any
issues of plagorism or copyright infringement? My website has a lot of completely
unique content I’ve either authored myself or outsourced but it seems a
lot of it is popping it up all over the internet without my permission. Do you know any ways
to help prevent content from being ripped off? I’d definitely
appreciate it.
Gymnasts need to do twistng backflips on the beam
( 4inches broad ) and landwithout falling and even wobbling!
I’m really enjoying the design and layout of your site.
It’s a very easy on the eyes which makes it much more enjoyable for me to come here and visit more often. Did
you hire out a designer to create your theme? Outstanding work!
Everything is very open with a precise explanation of the issues.
It was truly informative. Your site is very useful. Thank you for sharing!
Hi there, I want to subscribe for this web site to obtain hottest updates, thus where can i do it please assist.
Informal, textured material. Linen or cotton in warmer months; tweed, corduroy, and so on. for
colder weather.
What’s up to all, how is everything, I think every one is getting more from this website, and your views are
nice in support of new visitors.
Steel іs sturdy yet the price iѕ ovеr harⅾwood. Eaсh from components are actually readily availaƄle
in distinct painmt different colors and also they are extra to
ϲreɑte harmonious style.
Almost eighty percent of those same individuals said they had no trouble remembering
the particular company that gave them the product.
It really isn’t that difficult to find one if you know what you’re looking for.
You can compare products, identify features and specify the customization to become
made over the internet itself.
This design is incredible! You certainly know how to keep a reader amused.
Between your wit and your videos, I was almost
moved to start my own blog (well, almost…HaHa!) Wonderful job.
I really loved what you had to say, and more than that, how you
presented it. Too cool!
Asking questions are actually nice thing if you are not understanding anything completely, however this paragraph gives
pleasant understanding even.
Hello there! Do you use Twitter? I’d like to follow you if that
would be okay. I’m absolutely enjoying your blog and look forward to new posts.
I believe everything published made a ton of sense.
However, consider this, suppose you added a little information? I mean, I don’t wish to tell
you how to run your website, however suppose you added
something to maybe get folk’s attention? I mean Parsowanie dokumentu HTML w .NET – pietowski.com is kinda boring.
You should peek at Yahoo’s front page and watch how they create post titles to grab people to open the links.
You might add a video or a related picture or two to grab people interested about everything’ve got to say.
Just my opinion, it could bring your blog a little livelier.
Hi! I know this is kinda off topic however , I’d figured I’d ask.
Would you be interested in trading links or maybe guest writing a blog article or vice-versa?
My website discusses a lot of the same topics as yours and I believe we could greatly benefit from
each other. If you might be interested feel free to send me
an e-mail. I look forward to hearing from you! Great blog by
the way!
I really like it when folks get together and share opinions.
Great blog, continue the good work!
Howdy very nice blog!! Guy .. Beautiful .. Superb .. I will bookmark your website and
take the feeds also? I am glad to seek out so
many useful information here in the put up, we want work out extra techniques on this regard, thank
you for sharing. . . . . .
That’s really thinking of the highest order
Hi! I just wanted to ask if you ever have any problems with hackers?
My last blog (wordpress) was hacked and I ended up losing a
few months of hard work due to no back up. Do you have any
methods to protect against hackers?
Our latest addition to the limousine fleet is the 7 passenger Cadillac Escalade.
Once you do that take a spray bottle filled with 4 cups of warm water and a tablespoon of detergent and spray the stain. Listed
here are some interesting facts about Mercedes-Benz.
Yoս have noted very interesting poіnts! ps decent
internet site.
fantastic put up, very informative. I ponder why the opposite specialists of
this sector do not realize this. You must continue your writing.
I’m sure, you’ve a huge readers’ base already!
Every weekend i used to pay a visit this website, because i wish for enjoyment,
for the reason that this this web site conations
actually fastidious funny data too.
I had been browsing for a lengthy time for a zebra print backpack and this was beyond my expectations.
ΙncreԀible points. Great arguments. Ҝeeр up the amazing effort.
Hmm it looks like your website ate my first comment (it was extremely long) so I guess
I’ll just sum it up what I wrote and say, I’m thoroughly enjoying your blog.
I too am an aspiring blog blogger but I’m still new to everything.
Do you have any points for newbie blog writers? I’d certainly
appreciate it.
I know this site offers quality dependent articles and additional stuff, is there any other web site which provides such
information in quality?
If you are going for finest contents like myself, simply pay
a visit this site every day because it provides quality contents, thanks
Pretty component to content. I just stumbled upon your web site and in accession capital
to claim that I get actually loved account your blog
posts. Any way I will be subscribing to your feeds and even I success you access persistently
quickly.
At last! Someone who understands! Thanks for posting!
Thanks a bunch for sharing this with all people you actually recognize what you’re speaking approximately!
Bookmarked. Kindly additionally discuss with my web site =).
We could have a hyperlink exchange agreement
between uscheap nba jerseys
I lovwd as much as you’ll receive carried out right here.
The sketch is tasteful, your authored material stylish.
nonetheless, you command get bought ann edginess over that you wish be delivering the following.
unwell unquestionably come more formerly
again as exactly the same newrly very often inside case you shield this increase.
I think this is one of the most significant info for me.
And i am glad reading your article. But should remark
on few general things, The site style is great,
the articles is really nice : D. Good job, cheers
OK
Here are the top 3 positions that you can use to satisfy your loved once.
There’s a loud crash because even at only 10 mile
per hour, some mannequins are going to meet their end. Steve
Silver Company has the alternative to your superior challenge.
Some investing focused media outlets will substitute the
S&P 500 index for the Dow Jones Industrial Average when they announce which direction the market went.
ONE ‘Like’ can turn into 20 ‘Likes’ without any extra work
just because you have written a great news article.
This is no reflection of the quality of your goods — we are simply going in a different direction.
I needed to thank you for this fantastic read!!
I certainly loved every little bit of it. I have you book-marked to look at new stuff you post…
Great post. I used to be checking constantly this weblog and I am impressed!
Very helpful information specially the closing phase 🙂 I maintain such information much.
I used to be seeking this particular info for a long time.
Thanks and best of luck.
Una cosa! La respuesta de Biel y la mia han entrado a la misma hora! No pretendo quitarle la entrada! pero quizas se debiera valorar este “pequeño-gran” detalle, pues quizas entró su nombre antes por comenzar con B y yo con I! y se deba repartir 2 y 2 o yo que se!!! `pero cronológicamente, estan a la par!!!Gracias (y disculpa Biel ) insisto que no quiero robarte el premio
Hey there would you mind letting me know which hosting company
you’re working with? I’ve loaded your blog in 3 completely different
browsers and I must say this blog loads a lot faster then most.
Can you suggest a good web hosting provider at a fair price?
Thanks a lot, I appreciate it!