Start
March 3, 2020 - 12:30 pm
End
March 3, 2020 - 1:30 pm
Address
OnTechU, North Oshawa campus, UA 3230 View map
Speaker: Dr. Dhavide Aruliah (Quansight LLC)
Abstract: Data engineers and data scientists have benefited significantly from the proliferation of open-source packages supporting data visualization and exploratory data analysis and visualization of data in languages like Python or R. These tools permit practitioners to rapidly develop sophisticated models and algorithms through interactive data exploration. At the same time, storage costs have dropped and rates of data accumulation have increased, so they typically have to master highly specialized tools to manage larger and larger data sets especially when moving models into production. This often involves patching together frameworks with distinct interfaces creating many debugging and maintenance challenges.
Following some historical context, we’ll present Dask as a simple Python package to help with moving data science into production. Dask uses familiar Pythonic interfaces (notably NumPy & Pandas) to provide simple API extensions that work easily with out-of-core data sets and that capitalize on parallelism available. We present a straightforward introduction to using Dask with examples to illustrate how existing code manipulating NumPy arrays and Pandas DataFrames in memory can be extended to work on massive data sets cleanly.
Familiarity with the Python data science stack (e.g., NumPy, Pandas, etc) will be useful but not mandatory.
