You are viewing a preview of this job. Log in or register to view more details about this job.

Data Research Engineer Intern

Intern: Data Research and Engineering

  • Help build a high‑integrity dataset by aggregating and cross‑referencing records from multiple sources to enable analytics and outreach workflows at scale. 

Mission

  • Support data initiatives by assembling clean, linked metadata across entities, contributors, and organizations to power downstream insights and operations. 

Responsibilities

  • Collect and normalize data from public records, APIs, and permissible web sources, following terms of use, robots directives, and rate limits. 
  • Write Python scripts to parse semi‑structured inputs (HTML, JSON, CSV), handle pagination, and extract linked entities across related datasets. 
  • Implement entity resolution and de‑duplication to reconcile title variations, name variants, organizational name changes, and historical records. 
  • Design and run data quality checks (uniqueness, referential integrity, coverage), maintain lineage notes, and log exceptions for human review. 
  • Document schemas, cleaning rules, and edge cases; produce clear READMEs and handoff notes to ensure reproducibility. 
  • Collaborate with product and operations to align required fields, status flags, and SLAs with downstream workflows and notifications. 

Core skills

  • Proficiency in Python (Pandas, Requests, BeautifulSoup/Scrapy or comparable tools) and comfort with responsible API and web data collection. 
  • Strong SQL for joins, window functions, de‑duplication, and constraints; experience with cloud databases or notebooks is a plus. 
  • Experience with fuzzy matching and record linkage (string similarity, tokenization, basic ML heuristics) for entity resolution and confidence scoring. 
  • Version control (Git), clear documentation, and disciplined testing of data transforms and pipelines. 
  • High attention to detail with the ability to spot anomalies in names, dates, identifiers, and cross‑source inconsistencies. 

Nice to have

  • Exposure to orchestration and scheduling (Airflow, Prefect, Cron) to keep pipelines current and auditable. 
  • Basic dashboarding or reporting to track QA status, coverage, and pipeline health. 
  • Background in statistics or machine learning for linkage, anomaly detection, and threshold tuning. 

How we work (confidentiality first)

  • Operate under explicit do/don’t guidelines for data access and usage; NDAs and information‑security training are required prior to system access. 
  • Avoid hard‑coding secrets; use configuration management and environment variables, and maintain parameterized, reproducible jobs. 
  • Treat internal algorithms, rules compilations, pipelines, and vendor relationships as confidential and not for redistribution. 

Eligibility and logistics

  • Advanced undergraduates or graduate students in Computer Science, Data Science, Information Systems, or related fields; prior internship or capstone experience is a plus. 
  • Part‑time during term with option for full‑time during breaks; hybrid/remote options based on project phase and data‑access constraints. 
  • Credit or paid internship per university policies; include availability and academic requirements with your application.