[PageOneX] [dev] Further work in scraper script for Kiosko web

pablo rey pablo at basurama.org
Fri Mar 15 10:57:48 EDT 2013


Thanks Rafa, and welcome to the list.

We have now an expanded list with 690 newspapers (before they were 'just'
384). Almost doubled!

It's also important to mention that we also have a new column in the
kiosko.csv<https://github.com/numeroteca/pageonex/blob/master/public/kiosko.csv>
file
with the url of the online newspapers. As a first step we want to use it
for linking to the newspapers web site while coding front pages. Apart from
linking to kiosko.net it is nice to cite the source of the images properly.
This is what kiosko.net is doing, and might be a 'solution' to avoid
problems with data property.

After updating and populating the data base with this new list of
newspapers, we'll have to take in account that there are more newspapers
and in different order (different media_id?) when merging previous data
bases or threads (the ones in heroku).

best,
p




On Fri, Mar 15, 2013 at 10:43 AM, Rafael Porres Molina <rporres at gmail.com>wrote:

> Hi devs,
>
> First thing is to introduce myself: I'm a friend of Pablo's, Perl hacker
> and sysadmin. A while ago he told me about that pageonex needed a list of
> all the newspapers in Kiosko (kiosko.csv), and I found a way of doing it. I
> don't know very much of Ruby so I offered to write it in Perl. Since the
> list is not meant to be dynamic, we concluded that language was not a
> problem.
>
> I've updated the script to get the newspaper urls and to fetch more types
> of newspapers. Before it just listed the general newspapers. Now I've
> included everything that I found Kiosko can offer taking care of avoiding
> duplicates.
>
> If you have any doubt about how the script works, or you find any bug,
> please let me know ;-)
>
> Regards,
>
> Rafa
>
> _______________________________________________
> Pageonexdev mailing list
> Pageonexdev at mit.edu
> http://mailman.mit.edu/mailman/listinfo/pageonexdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/pageonexdev/attachments/20130315/63397430/attachment.htm


More information about the Pageonexdev mailing list