Assignments | Data Science Tools and Models

Name	Goal	Data
SQL-Declarative	Build skills writing declarative SQL queries	SQL-Data.ipynb
SQL-Imperative	Build skills writing imperative SQL functions by implementing network algorithms: Breadth-first-search (findi connected components) and computing PageRank	biogridEdges
Spark-Intro	Build basic skills working with RDDs.	SparkIntro1 SparkIntro2
Sprark-kNN	Implement kNN on Spark. Also implement TF-IDF statistic for pubmed abstracts.	kNN
Spark-LogReg	Implement regularized logistic regression on Spark using clinical trial descriptions.	LogReg
Deep Learning	Implement a deep learning network using Keras

The course has six assignments. Each assignment should take between
five and 15 hours to complete, and the time required on a given
assignment can vary widely depending upon student ability. The first
two assignments assume that students have access to an installation of
the Postgres database system. The next three assignments assume that
students have access to an installation of Apache Spark, which is a
Big Data distributed programming platform. It is possible to
partially complete the Spark assignments on laptop running Spark, but
for the biggest data sets, it is necessary to use a compute cluster.
Since most students do not have a compute cluster sitting around, at
Rice we direct students to use Amazon Web Services to run Spark in
distributed mode. A careful student can complete the assignments
using about $100 of computer time from Amazon (note that as a student,
it is possible to request free credits from Amazon, so that
effectively, the cost to the student is zero). The last assignment
requires use of TensorFlow, a standard deep learning tool. It is
possible to install TensorFlow on a laptop, but it is likely to be too
slow to complete the assignment easily. If it is, a student can use a
desktop machine with a relatively powerful GPU card, or rent a machine
from Amazon Web Services.

Exercises

Name	Description
RC/RA	Theoretical exercise using Relational Calculus and Relational Algebra
GD-NM	Implementation of gradient descent and Newton’s method using Python
EM	Implementation of Expectation-Maximization algorithm using Python
Outliers	Implementation of outlier detection, using kNN and Python

The course has four homeworks. Each homework is designed to require
between tree to five hours to complete. The homeworks are short
Python programming assignments that are meant to reinforce one of the
key ideas from the class. For example, the EM (expectation
maximization) algorithm is covered in class; students are asked to
implement the EM algorithm to solve a simple maximum likelihood
problem as a homework.

Readings/Research Summaries

Students read seminal papers on the systems and algorithms covered in class, as well an additional research paper on the use of the system or algorithm as applied to a domain of personal interest. Students submit a 1-2 page summary (answering a provided list of questions) for each reading assignment.

The assigned papers are:

EF Codd. A Relational Model of Data for Large Shared Data Banks. Commun ACM, 13(6) June 1970, pp. 377-387 link to paper
L Page, S. Brin, R Motwanim T Winograd. “The PageRank Citation Ranking: Brining Order to the Web” Stanford Digital Library Technologies, January 1998 link to paper
J Dean, S Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters” Commun ACM 51(1) Jan 2008, pp. 107-113. link to paper
AP Dempster, NM Laird, DB Rubin. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the royal statistical society Series B (methodological). 1977:1-38. link to paper OR Blimes J. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models (Tech. Rep. ICSI-TR-97-021). University of Berkeley. 1997. link to paper
M Zaharia, M Chowdhury, MJ Franklin, S Shenker, I Stoica. “Spark: Cluster Computing with Working Sets” Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010. link to paper
Bottou L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010: Springer; 2010. p. 177-86. link to paper
M Abadi, A Agarwal, P Barham, E Brevdo et al. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems” arXiv:1603.04467, 2016. link to paper

The instructions, questions to answer, and grading rubric may be found here.

Additional Information

The following papers may be of value to students unfamiliar with reading research papers:

1. Keshav S. How to read a paper. ACM SIGCOMM Computer Communication Review. 2007;37(3):83-4. link to paper

2. Purugganan M, Hewitt J. How to read a scientific article. Rice University. 2004. link to paper

Data Science Tools and Models

developed at Rice University by Risa Myers and Chris Jermaine