What is the course about?
Data analysis requires clean, orderly data to work on. Sadly most useful data comes from the real world, which is neither clean nor orderly. A great deal of an analyst’s time is spent bridging the gap between the two.
The process begins with getting the data into a form Python can work with. We look at various file formats (CSV, Excel, ARFF, Pickle, fixed-width, partially-structured, XML, HTML, JSON) as well as providing a crash course in databases and web APIs for the non-specialist. This exposes us to a variety of Python libraries. We also deal here with the problem of extracting data from a source that was never intended to be used that way, such as a web page or log file.
But this is only the first step; it’s unlikely, at this point, that your data will be of much use to you. The second half of the course offers a menagerie of issues you may encounter and techniques for resolving them. We discuss detecting and fixing errors in your data, often using statistical methods; encoding data in a convenient way; and finally “sculpting” your data into a form that reflects how you will use it. These often raise subtle methodological questions and there is not always a simple right way to proceed.
This is a live online course. You will need:
- Internet connection. The classes work best with Chrome.
- A computer with microphone and camera.
We will contact you with joining instructions before your course starts.
What will we cover?
• Extracting data from a wide variety of file types
• Querying a relational database
• Advanced Pandas and Numpy techniques
• Detecting and handling feature-level problems such as data corruption, human error and misunderstandings, outliers and inliers, and missing or redundant data.
• Detecting and handling observation-level problems such as inconsistency, duplication and sample bias.
• Using regression, interpolation and filters to transform messy or incomplete data
• Working with different levels of numerical precision and identifying problems related to numerical datatypes
• Use of Pandas to change dimensionality and granularity of your data
• Create and manage hierarchical indices in Pandas.
What will I achieve?
By the end of this course you should be able to...
• Import data into Python from a wide variety of common file formats
• Extract data from a database
• Use some advanced features from Pandas
• Describe and diagnose a wide variety of error types that real-life data can contain, and fix them
• Where multiple approaches to fixing a problem exist, describe the pros and cons of each and the implications for future analysis.
What level is the course and do I need any particular skills?
This is an advanced course. You should already be confident with basic Python programming at the level of our Introduction to Python course, and you should have used the basic features of Numpy and Pandas – Introduction to Data Analytics with Python is ideal preparation for this.
You do not need any prior knowledge of statistics or analytical methods.
How will I be taught, and will there be any work outside the class?
There will be some theoretical underpinning to the course, but it is nearly all practical, through demonstrations and practical programming and problem solving activities.
Are there any other costs? Is there anything I need to bring?
There are no additional costs. A pen and paper to take notes.
When I've finished, what course can I do next?
You might want to explore the Excel courses in data analysis such as: Data analysis with Power BI, Excel analysing data (stage 1 & 2) Introduction to DAX: data analysis expression for Power BI or you might find beneficial to attend one of our maths courses in Probability and statistics for Data Analysis.