Parseando El Mundo: problemas con tag html de cierre antes del contenido (#2) · Issues · numeroteca / HomePageX

Parseando El Mundo: problemas con tag html de cierre antes del contenido

En Homepagex hace falta parsear las páginas de inicio para detectar los titulares de las noticias (ver parseador https://code.montera34.com/numeroteca/homepagex/-/blob/master/html-parser.R).

El problema de la página de El Mundo es que tiene insertado justo después del header los siguientes tags: </body></html>:

</header></body></html> <div class="percentage-bar-container">

Si se elimina el tag </html> el parseador funciona.

Así es el parseador en R:

pageelmundo <- read_html("../../data/storytracker/tmp/http!www.elmundo.es!!!!@2018-04-25T10:01:03.196714+00:00") 
# gets all the text in article titles.  Headlines are in h3 a, but parsing does not work!
# TODO: it is not working because there is a html and body closing tag before the headlines start!!
# if those tags are removed, the parsing works
titles <- pageelmundo %>% html_nodes("h2") %>% html_text() %>% data.frame()

# total of articles with link
n_news <- nrow(titles) 

# select news that contain cerating word
select_news <- data.frame(titles[grepl(word, titles$title),])

Los archivos están guardados cada uno en su propio archivo comprimido, adjunto uno para hacer pruebas: http_www.elmundo.es_____2018-04-25T10_01_03.196714+00_00.gz

Estoy viendo si es posible eliminar ese </html> con R, pero no lo consigo. Quizás hay que hacerlo previamente con un script de bash ¿cómo sería para aplicarlos a todos los archivos de elmundo?