Commit 84f1314e authored by numeroteca's avatar numeroteca

creates README, reorder/clean script, add example results data

parent e1258b8d
Pipeline #14 failed with stages
Homenewscounter
==============
R script to count how many titles per hour that have certain words in a bunch of home pages of online newspapers archive.
# How to use Homenewscounter
## Get your copy of the home pages
Ask @numeroteca / http://numeroteca.org.
## Create the list of newspaper home pages
Run this where you have your downloaded files stored.
`for f in *.gz; do echo "$f" >> mylist.txt; done`
The `mylist.txt` has all the names of the .gz files that contain the html of the home pages.
## Create data frame with all the newspaper names time and date
Based on `mylist.txt` and using `html-parser.R` creates a `results.Rda` file with:
+ number of titles in home page
+ number of selected titles in home page that have certain selected words
+ percentage of titles that have certain words from the total in that home page.
You can load existing results files like `data/results-cifuentes-01.Rda`.
## Plots visualizations based on results
A series of visualizations to view the results obtained.
# FAQ
## Where are home pages html coming from?
We are using Storytracker (http://storytracker.pastpages.org/en/latest/) to store a list of newspaper home page every hour.
## Which newspapers are you storing?
### Spanish media
1. http://www.elpais.com
1. http://www.elmundo.es
1. http://www.abc.es/
1. http://www.larazon.es/
1. http://www.lavanguardia.com/
1. http://www.elperiodico.com/es/
1. http://www.ara.cat/
1. http://www.eldiario.es
1. http://www.elespanol.com
1. http://www.publico.es/
1. http://www.20minutos.es/
1. http://www.huffingtonpost.es/
1. https://www.infolibre.es/
1. http://www.elconfidencial.com/
1. http://www.rtve.es/
1. http://cadenaser.com/
1. http://www.cope.es/
1. http://www.ondacero.es/
1. http://www.efe.com/
1. http://esradio.libertaddigital.com/
1. http://www.libertaddigital.com/
1. http://www.vozpopuli.com/
1. http://www.lavozdegalicia.es/
1. http://www.elcorreo.com/
1. http://www.ccma.cat/tv3/
1. http://www.telemadrid.es/
1. http://elprogreso.galiciae.com/
1. https://okdiario.com
1. https://www.lamarea.com/
1. https://www.elsaltodiario.com/
1. https://www.naiz.eus/hemeroteca/gara
1. https://www.berria.eus/
1. https://www.naiz.eus/
1. http://www.diariovasco.com/
1. http://www.deia.com/
US and UK media
1. https://www.nytimes.com/
1. https://www.theguardian.com
No preview for this file type
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment