KEMBAR78
So long lxml, and thanks for all the fish by FiloSottile · Pull Request #329 · ytdl-org/youtube-dl · GitHub
Skip to content

Conversation

FiloSottile
Copy link
Collaborator

There was a lot of confusion over HTML and XML parsing:

  • _unescapeHTML did the same thing as htmlentity_transform but using the undocumented HTMLParser.unescape
  • sometimes _unescapeHTML was used, sometimes the complex to read syntax of htmlentity_transform was used instead
  • we depended on lxml (non public domain, MIT licensed) only to do what in Javascript is getElementById
  • there wasn't a consistent way to handle HTML to make it human-readable

So here are my changes:

  • now we have get_element_by_id that uses a custom HTMLParser to extract the HTML tag with a specific id, so lxml is gone
  • _unescapeHTML now uses htmlentity_transform and I applied (and tested) it across the code
  • clean_html does the tag-stripping and newline-handling needed to get a pretty text out of a bunch of HTML

Have a look at ./youtube-dl --get-description http://www.youtube.com/watch?v=BQ80NtYmktQ

Before (without lxml):

Il nuovo pezzo di Mattia Sterpi Deejay Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/...

Before (with lxml):

Il nuovo pezzo di Mattia Sterpi DeejaySeguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpiCompra il singolo su iTunes:http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1Pubblicato: 21/01/2012℗ 2012 Il Chiosco

Now:

Il nuovo pezzo di Mattia Sterpi Deejay

Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpi

Compra il singolo su iTunes:
http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1

Pubblicato: 21/01/2012
℗ 2012 Il Chiosco

@rmunn
Copy link

rmunn commented Apr 14, 2012

This code will fail on descriptions that contain non-ASCII characters, such as this video whose description is mostly in Japanese:

rmunn@localhost:~/test$ ./youtube-dl -l -c --skip-download --write-info-json http://www.youtube.com/watch?v=9wMwplzBK-Y
[youtube] Setting language
[youtube] 9wMwplzBK-Y: Downloading video webpage
[youtube] 9wMwplzBK-Y: Downloading video info webpage
[youtube] 9wMwplzBK-Y: Extracting video information
Traceback (most recent call last):
  File "./youtube-dl", line 4760, in <module>
    main()
  File "./youtube-dl", line 4751, in main
    _real_main()
  File "./youtube-dl", line 4735, in _real_main
    retcode = fd.download(all_urls)
  File "./youtube-dl", line 964, in download
    ie.extract(url)
  File "./youtube-dl", line 1230, in extract
    return self._real_extract(url)
  File "./youtube-dl", line 1496, in _real_extract
    video_description = get_element_by_id("eow-description", video_webpage)
  File "./youtube-dl", line 256, in get_element_by_id
    parser.loads(html)
  File "./youtube-dl", line 210, in loads
    self.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 393, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

@FiloSottile
Copy link
Collaborator Author

Thanks. Already got it and fixed in my master branch. Backporting soon.
But I think we need a standardized way to decode downloaded pages
automatically.

@phihag phihag merged commit 7a8501e into ytdl-org:master May 22, 2012
joedborg referenced this pull request in joedborg/youtube-dl Nov 17, 2020
[pull] master from ytdl-org:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants