(3-month internship)
Project Overview
If you are interested in Data Quality and Data Collection this internship could fit for you.
The goal is to find reliable websites and scrape information from them. If you are already familiar with this technique, you can use the programming language that you prefer, if not Python is recommended. The recommended libraries are, in particular: requests, polars/pandas, re, BeautifulSoup.
Considering the time available it will be required 1 or N datasets. The topic and the sources from which the data will be taken should be chosen together as well as the type of DataBase. The type of data you need to collect is purely textual. The final datasets should be well structured, reliable and clear.
Educational objectives
- Learning how to collect data from websites by scraping techniques.
- Find out about the privacy coverage of such public information and act accordingly.
- Learning how to manage huge amount of information and guarantee their quality
*** Required Skills
- Basic Python programming skill
- Critical discernment to understand which websites can be trusted
- Ethical integrity
*** Expected Outcomes
- 1 or more dataset containing textual information (maybe .csv or .xlsx)
- A report with the list of sites from which the information was taken and the laws related to the world of data privacy learnt during the internship
Timeline
From 3 to 6 months depending on requirements
Contacts