You are viewing a preview of this job. Log in or register to view more details about this job.

Data Research Engineer Intern

Intern: Data Research and Engineering

Help build a high‑integrity dataset by aggregating and cross‑referencing records from multiple sources to enable analytics and outreach workflows at scale.

Mission

Support data initiatives by assembling clean, linked metadata across entities, contributors, and organizations to power downstream insights and operations.

Responsibilities

Collect and normalize data from public records, APIs, and permissible web sources, following terms of use, robots directives, and rate limits.
Write Python scripts to parse semi‑structured inputs (HTML, JSON, CSV), handle pagination, and extract linked entities across related datasets.
Implement entity resolution and de‑duplication to reconcile title variations, name variants, organizational name changes, and historical records.
Design and run data quality checks (uniqueness, referential integrity, coverage), maintain lineage notes, and log exceptions for human review.
Document schemas, cleaning rules, and edge cases; produce clear READMEs and handoff notes to ensure reproducibility.
Collaborate with product and operations to align required fields, status flags, and SLAs with downstream workflows and notifications.

Core skills

Proficiency in Python (Pandas, Requests, BeautifulSoup/Scrapy or comparable tools) and comfort with responsible API and web data collection.
Strong SQL for joins, window functions, de‑duplication, and constraints; experience with cloud databases or notebooks is a plus.
Experience with fuzzy matching and record linkage (string similarity, tokenization, basic ML heuristics) for entity resolution and confidence scoring.
Version control (Git), clear documentation, and disciplined testing of data transforms and pipelines.
High attention to detail with the ability to spot anomalies in names, dates, identifiers, and cross‑source inconsistencies.

Nice to have

Exposure to orchestration and scheduling (Airflow, Prefect, Cron) to keep pipelines current and auditable.
Basic dashboarding or reporting to track QA status, coverage, and pipeline health.
Background in statistics or machine learning for linkage, anomaly detection, and threshold tuning.

How we work (confidentiality first)

Operate under explicit do/don’t guidelines for data access and usage; NDAs and information‑security training are required prior to system access.
Avoid hard‑coding secrets; use configuration management and environment variables, and maintain parameterized, reproducible jobs.
Treat internal algorithms, rules compilations, pipelines, and vendor relationships as confidential and not for redistribution.

Eligibility and logistics

Advanced undergraduates or graduate students in Computer Science, Data Science, Information Systems, or related fields; prior internship or capstone experience is a plus.
Part‑time during term with option for full‑time during breaks; hybrid/remote options based on project phase and data‑access constraints.
Credit or paid internship per university policies; include availability and academic requirements with your application.