KEMBAR78
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": null, "outputs": [], "source": [ "#r \"nuget: FSharp.Data,6.6.0\"\n", "\n", "Formatter.SetPreferredMimeTypesFor(typeof\u003cobj\u003e, \"text/plain\")\n", "Formatter.Register(fun (x: obj) (writer: TextWriter) -\u003e fprintfn writer \"%120A\" x)\n", "#endif\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](../img/badge-binder.svg)](https://mybinder.org/v2/gh/fsprojects/FSharp.Data/gh-pages?filepath=library/HtmlParser.ipynb)\u0026emsp;\n", "[![Script](../img/badge-script.svg)](https://fsprojects.github.io/FSharp.Data//library/HtmlParser.fsx)\u0026emsp;\n", "[![Notebook](../img/badge-notebook.svg)](https://fsprojects.github.io/FSharp.Data//library/HtmlParser.ipynb)\n", "\n", "# HTML Parser\n", "\n", "This article demonstrates how to use the HTML Parser to parse HTML files.\n", "\n", "The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM.\n", "The parser is based on the [HTML Living Standard](http://www.whatwg.org/specs/web-apps/current-work/multipage/index.html#contents)\n", "Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page\n", "independently of the actual HTML Type provider.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 2, "outputs": [], "source": [ "open FSharp.Data\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The following example uses Google to search for `FSharp.Data` and then parses the first set of\n", "search results from the page, extracting the URL and Title of the link.\n", "We use the [HtmlDocument](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) type.\n", "\n", "To achieve this we must first parse the webpage into our DOM. We can do this using\n", "the [HtmlDocument.Load](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) method. This method will take a URL and make a synchronous web call\n", "to extract the data from the page. Note: an asynchronous variant [HtmlDocument.AsyncLoad](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) is also available\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 3, "outputs": [ { "data": { "text/plain": ["val results: HtmlDocument =", "", " \u003c!DOCTYPE html\u003e", "", "\u003chtml lang=\"en\"\u003e", "", " \u003chead\u003e", "", " \u003ctitle\u003eGoogle Search\u003c/title\u003e\u003cstyle\u003ebody{background-color:#fff}\u003c/style\u003e\u003cscript nonce=\"tZRABT-PsFgnaoFDQnZkDQ\"\u003ewindow.google = window.google || {};window.google.c = window.google.c || {ezx:false,cap:0};\u003c/script\u003e", "", " \u003c/head\u003e", "", " \u003cbody\u003e", "", " \u003cnoscript\u003e", "", " \u003cstyle\u003etable,div,span,p{display:none}\u003c/style\u003e\u003cmeta content=\"0;url=/httpservice/retry/enablejs?sei=2rO1aOSnJ6inqtsPupO0kAk\" http-equiv=\"refresh\" /\u003e", "", " \u003cdiv style=\"display:block\"\u003e", "", " Please click \u003ca..."] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }], "source": [ "let results = HtmlDocument.Load(\"http://www.google.co.uk/search?q=FSharp.Data\")\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a loaded HTML document we can begin to extract data from it.\n", "Firstly, we want to extract all of the anchor tags `a` out of the document, then\n", "inspect the links to see if it has a `href` attribute, using [HtmlDocumentExtensions.Descendants](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocumentextensions.html#Descendants). If it does, extract the value,\n", "which in this case is the url that the search result is pointing to, and additionally the\n", "`InnerText` of the anchor tag to provide the name of the web page for the search result\n", "we are looking at.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 4, "outputs": [ { "data": { "text/plain": ["val links: (string * string) list =", "", " [(\"here\", \"/httpservice/retry/enablejs?sei=2rO1aOSnJ6inqtsPupO0kAk\");", "", " (\"click here\",", "", " \"/search?q=FSharp.Data\u0026sca_esv=6bae24b5c791315b\u0026ie=UTF-8\u0026emsg=\"+[34 chars]);", "", " (\"feedback\", \"https://support.google.com/websearch\")]"] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }], "source": [ "let links =\n", " results.Descendants [ \"a\" ]\n", " |\u003e Seq.choose (fun x -\u003e x.TryGetAttribute(\"href\") |\u003e Option.map (fun a -\u003e x.InnerText(), a.Value()))\n", " |\u003e Seq.truncate 10\n", " |\u003e Seq.toList\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have extracted our search results you will notice that there are lots of\n", "other links to various Google services and cached/similar results. Ideally, we would\n", "like to filter these results as we are probably not interested in them.\n", "At this point we simply have a sequence of Tuples, so F# makes this trivial using `Seq.filter`\n", "and `Seq.map`.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 5, "outputs": [ { "data": { "text/plain": ["val searchResults: (string * string) list = []"] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }], "source": [ "let searchResults =\n", " links\n", " |\u003e List.filter (fun (name, url) -\u003e name \u003c\u003e \"Cached\" \u0026\u0026 name \u003c\u003e \"Similar\" \u0026\u0026 url.StartsWith(\"/url?\"))\n", " |\u003e List.map (fun (name, url) -\u003e name, url.Substring(0, url.IndexOf(\"\u0026sa=\")).Replace(\"/url?q=\", \"\"))\n" ] } ], "metadata": { "kernelspec": { "display_name": ".NET (F#)", "language": "F#", "name": ".net-fsharp" }, "language_info": { "file_extension": ".fs", "mimetype": "text/x-fsharp", "name": "polyglot-notebook", "pygments_lexer": "fsharp" }, "polyglot_notebook": { "kernelInfo": { "defaultKernelName": "fsharp", "items": [ { "aliases": [], "languageName": "fsharp", "name": "fsharp" } ] } } }, "nbformat": 4, "nbformat_minor": 2 }