README.md 3.06 KB
Newer Older
numeroteca's avatar
numeroteca committed
1
HomePageX
2 3
==============

numeroteca's avatar
numeroteca committed
4
This is a content analysis software to analyze home page of online newspapers.
5

numeroteca's avatar
numeroteca committed
6
With the help of an R script we can count how many titles of news that have certain words in a bunch of home pages of online newspapers files.
7

numeroteca's avatar
numeroteca committed
8
![Percentage of news abow the Cifuenes scandal](http://numeroteca.org/wp-content/uploads/2018/06/porcentaje-noticias-portada-diarios-digitales-cifuentes-b.jpg "Porcentaje de noticias en portada sobre el escándalo de Cifuentes")
9

10
View example: [El escándalo del TFM de Cifuentes en las páginas de inicio](http://numeroteca.org/2018/06/08/escandalo-tfm-cifuentes-paginas-inicio-periodicos-digitales/) (6/2018, numeroteca.org) o [la cobertura de los partidos en periodo electoral](http://numeroteca.org/2019/06/24/cobertura-de-partidos-en-paginas-de-inicio-en-elecciones-generales-28a/) (6/2019, numeroteca.org).
11

numeroteca's avatar
numeroteca committed
12
# How to use Homepagex
13 14 15 16 17 18 19

## Get your copy of the home pages

Ask @numeroteca / http://numeroteca.org.

## Create the list of newspaper home pages

20
Run this where you have your downloaded files stored in the command line (bash).
21 22 23

`for f in *.gz; do echo "$f" >> mylist.txt; done`

24
The `mylist.txt` created file has all the names of the .gz files that contain the html of the home pages. 
25 26 27

## Create data frame with all the newspaper names time and date

28
Based on `mylist.txt` and using `html-parser.R` create a `results.Rda` file with:
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

+ number of titles in home page 
+ number of selected titles in home page that have certain selected words 
+ percentage of titles that have certain words from the total in that home page.

You can load existing results files like `data/results-cifuentes-01.Rda`.

## Plots visualizations based on results

A series of visualizations to view the results obtained.

# FAQ

## Where are home pages html coming from?

44
We are using Storytracker script (http://storytracker.pastpages.org/en/latest/) to store in our own server a list of newspaper home page every hour. We only save the html of the page.
45 46 47 48 49

## Which newspapers are you storing?

### Spanish media

50
1. http://www.elpais.com  
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
1. http://www.elmundo.es
1. http://www.abc.es/
1. http://www.larazon.es/
1. http://www.lavanguardia.com/
1. http://www.elperiodico.com/es/
1. http://www.ara.cat/
1. http://www.eldiario.es
1. http://www.elespanol.com
1. http://www.publico.es/
1. http://www.20minutos.es/
1. http://www.huffingtonpost.es/
1. https://www.infolibre.es/
1. http://www.elconfidencial.com/
1. http://www.rtve.es/
1. http://cadenaser.com/
1. http://www.cope.es/
1. http://www.ondacero.es/
1. http://www.efe.com/
1. http://esradio.libertaddigital.com/
1. http://www.libertaddigital.com/
1. http://www.vozpopuli.com/
1. http://www.lavozdegalicia.es/
1. http://www.elcorreo.com/
1. http://www.ccma.cat/tv3/
1. http://www.telemadrid.es/
1. http://elprogreso.galiciae.com/
1. https://okdiario.com
1. https://www.lamarea.com/
1. https://www.elsaltodiario.com/
1. https://www.naiz.eus/hemeroteca/gara
1. https://www.berria.eus/
1. https://www.naiz.eus/
1. http://www.diariovasco.com/
1. http://www.deia.com/

US and UK media

1. https://www.nytimes.com/
1. https://www.theguardian.com