{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
}
,
{
"cell_type": "code",
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"polyglot_notebook": {
"kernelName": "fsharp"
}
},
"execution_count": null, "outputs": [],
"source": [
"#r \"nuget: FSharp.Data,6.6.0\"\n",
"\n",
"Formatter.SetPreferredMimeTypesFor(typeof\u003cobj\u003e, \"text/plain\")\n",
"Formatter.Register(fun (x: obj) (writer: TextWriter) -\u003e fprintfn writer \"%120A\" x)\n",
"#endif\n"
]
}
,
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[](https://mybinder.org/v2/gh/fsprojects/FSharp.Data/gh-pages?filepath=library/HtmlParser.ipynb)\u0026emsp;\n",
"[](https://fsprojects.github.io/FSharp.Data//library/HtmlParser.fsx)\u0026emsp;\n",
"[](https://fsprojects.github.io/FSharp.Data//library/HtmlParser.ipynb)\n",
"\n",
"# HTML Parser\n",
"\n",
"This article demonstrates how to use the HTML Parser to parse HTML files.\n",
"\n",
"The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM.\n",
"The parser is based on the [HTML Living Standard](http://www.whatwg.org/specs/web-apps/current-work/multipage/index.html#contents)\n",
"Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page\n",
"independently of the actual HTML Type provider.\n",
"\n"
]
}
,
{
"cell_type": "code",
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"polyglot_notebook": {
"kernelName": "fsharp"
}
},
"execution_count": 2, "outputs": [],
"source": [
"open FSharp.Data\n"
]
}
,
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following example uses Google to search for `FSharp.Data` and then parses the first set of\n",
"search results from the page, extracting the URL and Title of the link.\n",
"We use the [HtmlDocument](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) type.\n",
"\n",
"To achieve this we must first parse the webpage into our DOM. We can do this using\n",
"the [HtmlDocument.Load](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) method. This method will take a URL and make a synchronous web call\n",
"to extract the data from the page. Note: an asynchronous variant [HtmlDocument.AsyncLoad](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocument.html) is also available\n",
"\n"
]
}
,
{
"cell_type": "code",
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"polyglot_notebook": {
"kernelName": "fsharp"
}
},
"execution_count": 3, "outputs": [
{
"data": {
"text/plain": ["val results: HtmlDocument =",
"",
" \u003c!DOCTYPE html\u003e",
"",
"\u003chtml lang=\"en\"\u003e",
"",
" \u003chead\u003e",
"",
" \u003ctitle\u003eGoogle Search\u003c/title\u003e\u003cstyle\u003ebody{background-color:#fff}\u003c/style\u003e\u003cscript nonce=\"tZRABT-PsFgnaoFDQnZkDQ\"\u003ewindow.google = window.google || {};window.google.c = window.google.c || {ezx:false,cap:0};\u003c/script\u003e",
"",
" \u003c/head\u003e",
"",
" \u003cbody\u003e",
"",
" \u003cnoscript\u003e",
"",
" \u003cstyle\u003etable,div,span,p{display:none}\u003c/style\u003e\u003cmeta content=\"0;url=/httpservice/retry/enablejs?sei=2rO1aOSnJ6inqtsPupO0kAk\" http-equiv=\"refresh\" /\u003e",
"",
" \u003cdiv style=\"display:block\"\u003e",
"",
" Please click \u003ca..."]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}],
"source": [
"let results = HtmlDocument.Load(\"http://www.google.co.uk/search?q=FSharp.Data\")\n"
]
}
,
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a loaded HTML document we can begin to extract data from it.\n",
"Firstly, we want to extract all of the anchor tags `a` out of the document, then\n",
"inspect the links to see if it has a `href` attribute, using [HtmlDocumentExtensions.Descendants](https://fsprojects.github.io/FSharp.Data/reference/fsharp-data-htmldocumentextensions.html#Descendants). If it does, extract the value,\n",
"which in this case is the url that the search result is pointing to, and additionally the\n",
"`InnerText` of the anchor tag to provide the name of the web page for the search result\n",
"we are looking at.\n",
"\n"
]
}
,
{
"cell_type": "code",
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"polyglot_notebook": {
"kernelName": "fsharp"
}
},
"execution_count": 4, "outputs": [
{
"data": {
"text/plain": ["val links: (string * string) list =",
"",
" [(\"here\", \"/httpservice/retry/enablejs?sei=2rO1aOSnJ6inqtsPupO0kAk\");",
"",
" (\"click here\",",
"",
" \"/search?q=FSharp.Data\u0026sca_esv=6bae24b5c791315b\u0026ie=UTF-8\u0026emsg=\"+[34 chars]);",
"",
" (\"feedback\", \"https://support.google.com/websearch\")]"]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}],
"source": [
"let links =\n",
" results.Descendants [ \"a\" ]\n",
" |\u003e Seq.choose (fun x -\u003e x.TryGetAttribute(\"href\") |\u003e Option.map (fun a -\u003e x.InnerText(), a.Value()))\n",
" |\u003e Seq.truncate 10\n",
" |\u003e Seq.toList\n"
]
}
,
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have extracted our search results you will notice that there are lots of\n",
"other links to various Google services and cached/similar results. Ideally, we would\n",
"like to filter these results as we are probably not interested in them.\n",
"At this point we simply have a sequence of Tuples, so F# makes this trivial using `Seq.filter`\n",
"and `Seq.map`.\n",
"\n"
]
}
,
{
"cell_type": "code",
"metadata": {
"dotnet_interactive": {
"language": "fsharp"
},
"polyglot_notebook": {
"kernelName": "fsharp"
}
},
"execution_count": 5, "outputs": [
{
"data": {
"text/plain": ["val searchResults: (string * string) list = []"]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}],
"source": [
"let searchResults =\n",
" links\n",
" |\u003e List.filter (fun (name, url) -\u003e name \u003c\u003e \"Cached\" \u0026\u0026 name \u003c\u003e \"Similar\" \u0026\u0026 url.StartsWith(\"/url?\"))\n",
" |\u003e List.map (fun (name, url) -\u003e name, url.Substring(0, url.IndexOf(\"\u0026sa=\")).Replace(\"/url?q=\", \"\"))\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".NET (F#)",
"language": "F#",
"name": ".net-fsharp"
},
"language_info": {
"file_extension": ".fs",
"mimetype": "text/x-fsharp",
"name": "polyglot-notebook",
"pygments_lexer": "fsharp"
},
"polyglot_notebook": {
"kernelInfo": {
"defaultKernelName": "fsharp",
"items": [
{
"aliases": [],
"languageName": "fsharp",
"name": "fsharp"
}
]
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}