KEMBAR78
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": null, "outputs": [], "source": [ "#r \"nuget: FSharp.Data,6.6.0\"\n", "\n", "Formatter.SetPreferredMimeTypesFor(typeof\u003cobj\u003e, \"text/plain\")\n", "Formatter.Register(fun (x: obj) (writer: TextWriter) -\u003e fprintfn writer \"%120A\" x)\n", "#endif\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](../img/badge-binder.svg)](https://mybinder.org/v2/gh/fsprojects/FSharp.Data/gh-pages?filepath=library/HtmlProvider.ipynb)\u0026emsp;\n", "[![Script](../img/badge-script.svg)](https://fsprojects.github.io/FSharp.Data//library/HtmlProvider.fsx)\u0026emsp;\n", "[![Notebook](../img/badge-notebook.svg)](https://fsprojects.github.io/FSharp.Data//library/HtmlProvider.ipynb)\n", "\n", "# HTML Type Provider\n", "\n", "This article demonstrates how to use the HTML type provider to read HTML tables files\n", "in a statically typed way.\n", "\n", "The HTML Type Provider takes a sample HTML document as input and generates a type based on the data\n", "present in the columns of that sample. The column names are obtained from the first (header) row.\n", "\n", "## Introducing the provider\n", "\n", "The type provider is located in the `FSharp.Data.dll` assembly. Assuming the assembly\n", "is located in the `../../../bin` directory, we can load it in F# Interactive as follows:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 2, "outputs": [], "source": [ "open FSharp.Data\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "### Parsing F1 Calendar Data\n", "\n", "This example shows an example of using the HTML Type Provider to extract each row from a table on a Wikipedia page.\n", "\n", "Usually with HTML files headers are demarked by using the `\u003cth\u003e` tag, however this is not true in general, so the provider assumes that the\n", "first row is headers. (This behaviour is likely to get smarter in later releases). But it highlights a general problem about HTML\u0027s strictness.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 3, "outputs": [], "source": [ "[\u003cLiteral\u003e]\n", "let F1_2017_URL =\n", " \"https://en.wikipedia.org/wiki/2017_FIA_Formula_One_World_Championship\"\n", "\n", "type F1_2017 = HtmlProvider\u003cF1_2017_URL\u003e\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The generated type provides a type space of tables that it has managed to parse out of the given HTML Document.\n", "Each type\u0027s name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these\n", "entities exist then the table will simply be named `Tablexx` where xx is the position in the HTML document if all of the tables were flattened out into a list.\n", "The `Load` method allows reading the data from a file or web resource. We could also have used a web URL instead of a local file in the sample parameter of the type provider.\n", "The following sample calls the `Load` method with an URL that points to a live version of the same page on Wikipedia.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 4, "outputs": [ { "data": { "text/plain": ["Race, round \"1\" is hosted at \"Australian Grand Prix\" on \"26 March\"", "", "Race, round \"2\" is hosted at \"Chinese Grand Prix\" on \"9 April\"", "", "Race, round \"3\" is hosted at \"Bahrain Grand Prix\" on \"16 April\"", "", "Race, round \"4\" is hosted at \"Russian Grand Prix\" on \"30 April\"", "", "Race, round \"5\" is hosted at \"Spanish Grand Prix\" on \"14 May\"", "", "Race, round \"6\" is hosted at \"Monaco Grand Prix\" on \"28 May\"", "", "Race, round \"7\" is hosted at \"Canadian Grand Prix\" on \"11 June\"", "", "Race, round \"8\" is hosted at \"Azerbaijan Grand Prix\" on \"25 June\"", "", "Race, round \"9\" is hosted at \"Austrian Grand Prix\" on \"9 July\"", "", "Race, round \"10\" is hosted at \"British Grand Prix\" on \"16 July\"", "", "Race, round \"11\" is hosted at \"Hungarian Grand Prix\" on \"30 July\"", "", "Race, round \"12\" is hosted at \"Belgian Grand Prix\" on \"27 August\"", "", "Race, round \"13\" is hosted at \"Italian Grand Prix\" on \"3 September\"", "", "Race, round \"14\" is hosted at \"Singapore Grand Prix\" on \"17 September\"", "", "Race, round \"15\" is hosted at \"Malaysian Grand Prix\" on \"1 October\"", "", "Race, round \"16\" is hosted at \"Japanese Grand Prix\" on \"8 October\"", "", "Race, round \"17\" is hosted at \"United States Grand Prix\" on \"22 October\"", "", "Race, round \"18\" is hosted at \"Mexican Grand Prix\" on \"29 October\"", "", "Race, round \"19\" is hosted at \"Brazilian Grand Prix\" on \"12 November\"", "", "Race, round \"20\" is hosted at \"Abu Dhabi Grand Prix\" on \"26 November\"", "", "Race, round \"Source:[63]\" is hosted at \"Source:[63]\" on \"Source:[63]\"", "", "val f1Calendar: HtmlProvider\u003c...\u003e.Calendar", "", "val firstRow: HtmlProvider\u003c...\u003e.Calendar.Row =", "", " (\"1\", \"Australian Grand Prix\", \"Albert Park Circuit, Melbourne\", \"26 March\")", "", "val round: string = \"1\"", "", "val grandPrix: string = \"Australian Grand Prix\"", "", "val date: string = \"26 March\"", "", "val it: unit = ()"] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }], "source": [ "// Download the table for the 2017 F1 calendar from Wikipedia\n", "let f1Calendar = F1_2017.Load(F1_2017_URL).Tables.Calendar\n", "\n", "// Look at the top row, being the first race of the calendar\n", "let firstRow = f1Calendar.Rows |\u003e Seq.head\n", "let round = firstRow.Round\n", "let grandPrix = firstRow.``Grand Prix``\n", "let date = firstRow.Date\n", "\n", "// Print the round, location and date for each race, corresponding to a row\n", "for row in f1Calendar.Rows do\n", " printfn \"Race, round %A is hosted at %A on %A\" row.Round row.``Grand Prix`` row.Date\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "The generated type has a property `Rows` that returns the data from the HTML file as a\n", "collection of rows. We iterate over the rows using a `for` loop. As you can see the\n", "(generated) type for rows has properties such as `Grand Prix`, `Circuit`, `Round` and `Date` that correspond\n", "to the columns in the selected HTML table file.\n", "\n", "As you can see, the type provider also infers types of individual rows. The `Date`\n", "property is inferred to be a `DateTime` (because the values in the sample file can all\n", "be parsed as dates) while other columns are inferred as the correct type where possible.\n", "\n", "### Parsing Nuget package stats\n", "\n", "This small sample shows how the HTML Type Provider can be used to scrape data from a website. In this example, we analyze the download counts of the FSharp.Data package on NuGet.\n", "Note that we\u0027re using the live URL as the sample, so we can just use the default constructor as the runtime data will be the same as the compile time data.\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 5, "outputs": [ { "data": { "text/plain": ["type NugetStats = HtmlProvider\u003c...\u003e", "", "val rawStats: HtmlProvider\u003c...\u003e.VersionHistoryOfFSharpData", "", "val getMinorVersion: v: string -\u003e string", "", "val stats: (string * decimal) array =", "", " [|(\"6.6\", 140747M); (\"6.5\", 3788M); (\"6.4\", 589138M); (\"6.3\", 397317M);", "", " (\"6.2\", 156895M); (\"6.1\", 3205M); (\"6.0\", 18902M); (\"5.0\", 494859M);", "", " (\"4.2\", 944741M); (\"4.1\", 207622M); (\"4.0\", 124078M); (\"3.3\", 1301429M);", "", " (\"3.2\", 69107M); (\"3.1\", 265699M); (\"3.0\", 661332M); (\"2.4\", 492190M);", "", " (\"2.3\", 659792M); (\"2.2\", 380499M); (\"2.1\", 46816M); (\"2.0\", 173216M);", "", " (\"1.1\", 128894M); (\"1.0\", 80907M)|]"] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }], "source": [ "// Configure the type provider\n", "type NugetStats = HtmlProvider\u003c\"https://www.nuget.org/packages/FSharp.Data\"\u003e\n", "\n", "// load the live package stats for FSharp.Data\n", "let rawStats = NugetStats().Tables.``Version History of FSharp.Data``\n", "\n", "// helper function to analyze version numbers from Nuget\n", "let getMinorVersion (v: string) =\n", " System.Text.RegularExpressions.Regex(@\"\\d.\\d\").Match(v).Value\n", "\n", "// group by minor version and calculate the download count\n", "let stats =\n", " rawStats.Rows\n", " |\u003e Seq.groupBy (fun r -\u003e getMinorVersion r.Version)\n", " |\u003e Seq.map (fun (k, xs) -\u003e k, xs |\u003e Seq.sumBy (fun x -\u003e x.Downloads))\n", " |\u003e Seq.toArray\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting statistics on Doctor Who\n", "\n", "This sample shows some more screen scraping from Wikipedia:\n", "\n" ] } , { "cell_type": "code", "metadata": { "dotnet_interactive": { "language": "fsharp" }, "polyglot_notebook": { "kernelName": "fsharp" } }, "execution_count": 6, "outputs": [ { "data": { "text/plain": ["[\u003cLiteral\u003e]", "", "val DrWho: string", "", " =", "", " \"https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)\"", "", "val doctorWho: HtmlProvider\u003c...\u003e", "", "val viewersByDoctor: (string * float) array =", "", " [|(\"Waris Hussein\", 8.0); (\"\", nan); (\"Christopher Barry\", 8.275);", "", " (\"Richard Martin\", 10.025); (\"Frank Cox\", 7.9); (\"John Crockett\", 8.0);", "", " (\"John Gorrie\", 9.066666667); (\"Mervyn Pinfield\", 6.925);", "", " (\"Henric Hirsch\", 6.733333333)|]"] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }], "source": [ "[\u003cLiteral\u003e]\n", "let DrWho =\n", " \"https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)\"\n", "\n", "let doctorWho = new HtmlProvider\u003cDrWho\u003e()\n", "\n", "// Get the average number of viewers for each doctor\u0027s series run\n", "let viewersByDoctor =\n", " doctorWho.Tables.``Season 1 (1963-1964)``.Rows\n", " |\u003e Seq.groupBy (fun season -\u003e season.``Directed by``)\n", " |\u003e Seq.map (fun (doctor, seasons) -\u003e\n", " let averaged =\n", " seasons |\u003e Seq.averageBy (fun season -\u003e season.``UK viewers (millions)``)\n", "\n", " doctor, averaged)\n", " |\u003e Seq.toArray\n" ] } , { "cell_type": "markdown", "metadata": {}, "source": [ "## Related articles\n", "\n", "* [HTML Parser](HtmlParser.html) - provides more information about\n", "working with HTML documents dynamically.\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": ".NET (F#)", "language": "F#", "name": ".net-fsharp" }, "language_info": { "file_extension": ".fs", "mimetype": "text/x-fsharp", "name": "polyglot-notebook", "pygments_lexer": "fsharp" }, "polyglot_notebook": { "kernelInfo": { "defaultKernelName": "fsharp", "items": [ { "aliases": [], "languageName": "fsharp", "name": "fsharp" } ] } } }, "nbformat": 4, "nbformat_minor": 2 }