Data Research Engineer Intern
Intern: Data Research and Engineering
- Help build a high‑integrity dataset by aggregating and cross‑referencing records from multiple sources to enable analytics and outreach workflows at scale.
Mission
- Support data initiatives by assembling clean, linked metadata across entities, contributors, and organizations to power downstream insights and operations.
Responsibilities
- Collect and normalize data from public records, APIs, and permissible web sources, following terms of use, robots directives, and rate limits.
- Write Python scripts to parse semi‑structured inputs (HTML, JSON, CSV), handle pagination, and extract linked entities across related datasets.
- Implement entity resolution and de‑duplication to reconcile title variations, name variants, organizational name changes, and historical records.
- Design and run data quality checks (uniqueness, referential integrity, coverage), maintain lineage notes, and log exceptions for human review.
- Document schemas, cleaning rules, and edge cases; produce clear READMEs and handoff notes to ensure reproducibility.
- Collaborate with product and operations to align required fields, status flags, and SLAs with downstream workflows and notifications.
Core skills
- Proficiency in Python (Pandas, Requests, BeautifulSoup/Scrapy or comparable tools) and comfort with responsible API and web data collection.
- Strong SQL for joins, window functions, de‑duplication, and constraints; experience with cloud databases or notebooks is a plus.
- Experience with fuzzy matching and record linkage (string similarity, tokenization, basic ML heuristics) for entity resolution and confidence scoring.
- Version control (Git), clear documentation, and disciplined testing of data transforms and pipelines.
- High attention to detail with the ability to spot anomalies in names, dates, identifiers, and cross‑source inconsistencies.
Nice to have
- Exposure to orchestration and scheduling (Airflow, Prefect, Cron) to keep pipelines current and auditable.
- Basic dashboarding or reporting to track QA status, coverage, and pipeline health.
- Background in statistics or machine learning for linkage, anomaly detection, and threshold tuning.
How we work (confidentiality first)
- Operate under explicit do/don’t guidelines for data access and usage; NDAs and information‑security training are required prior to system access.
- Avoid hard‑coding secrets; use configuration management and environment variables, and maintain parameterized, reproducible jobs.
- Treat internal algorithms, rules compilations, pipelines, and vendor relationships as confidential and not for redistribution.
Eligibility and logistics
- Advanced undergraduates or graduate students in Computer Science, Data Science, Information Systems, or related fields; prior internship or capstone experience is a plus.
- Part‑time during term with option for full‑time during breaks; hybrid/remote options based on project phase and data‑access constraints.
- Credit or paid internship per university policies; include availability and academic requirements with your application.