-
Practical data analysis cookbook
-
點閱:8
1人已收藏
- 作者: Tomasz Drabas
- 出版社:Packt Publishing Ltd.
- 出版年:2016
- ISBN:9781783551668; 9781783558513
- 格式:EPUB 流式,PDF,JPG
Data analysis is the process of systematically applying statistical and logical techniques to describe and illustrate, condense and recap, and evaluate data. Its importance has been most visible in the sector of information and communication technologies. It is an employee asset in almost all economy sectors.
This book provides a rich set of independent recipes that dive into the world of data analytics and modeling using a variety of approaches, tools, and algorithms. You will learn the basics of data handling and modeling, and will build your skills gradually toward more advanced topics such as simulations, raw text processing, social interactions analysis, and more.
First, you will learn some easy-to-follow practical techniques on how to read, write, clean, reformat, explore, and understand your data―arguably the most time-consuming (and the most important) tasks for any data scientist.
In the second section, different independent recipes delve into intermediate topics such as classification, clustering, predicting, and more. With the help of these easy-to-follow recipes, you will also learn techniques that can easily be expanded to solve other real-life problems such as building recommendation engines or predictive models.
In the third section, you will explore more advanced topics: from the field of graph theory through natural language processing, discrete choice modeling to simulations. You will also get to expand your knowledge on identifying fraud origin with the help of a graph, scrape Internet websites, and classify movies based on their reviews.
By the end of this book, you will be able to efficiently use the vast array of tools that the Python environment has to offer.
- Preface
- Chapter 1 : Preparing the Data
- Introduction
- Reading and writing CSV / TSV files with Python
- Reading and writing JSON files with Python
- Reading and writing Excel files with Python
- Reading and writing XML files with Python
- Retrieving HTML pages with pandas
- Storing and retrieving from a relational database
- Storing and retrieving from MongoDB
- Opening and transforming data with OpenRefine
- Exploring the data with Open Refine
- Removing duplicates
- Using regular expressions and GREL to clean up data
- Imputing missing observations
- Normalizing and standardizing the features
- Binning the observations
- Encoding categorical variables
- Chapter 2 : Exploring the Data
- Introduction
- Producing descriptive statistics
- Exploring correlations between features
- Visualizing the interactions between features
- Producing histograms
- Creating multivariate charts
- Sampling the data
- Splitting the dataset into training, cross-validation, and testing
- Chapter 3 : Classification Techniques
- Introduction
- Testing and comparing the models
- Classifying with Naïve Bayes
- Using logistic regression as a universal classifier
- Utilizing Support Vector Machines as a classification engine
- Classifying calls with decision trees
- Predicting subscribers with random tree forests
- Employing neural networks to classify calls
- Chapter 4 : Clustering Techniques
- Introduction
- Assessing the performance of a clustering method
- Clustering data with k-means algorithm
- Finding an optimal number of clusters for k-means
- Discovering clusters with mean shift clustering model
- Building fuzzy clustering model with c-means
- Using hierarchical model to cluster your data
- Finding groups of potential subscribers with DBSCAN and BIRCH algorithms
- Chapter 5 : Reducing Dimensions
- Introduction
- Creating three-dimensional scatter plots to present principal components
- Reducing the dimensions using the kernel version of PCA
- Using Principal Component Analysis to find things that matter
- Finding the principal components in your data using randomized PCA
- Extracting the useful dimensions using Linear Discriminant Analysis
- Using various dimension reduction techniques to classify calls using the k-Nearest Neighbors classification model
- Chapter 6 : Regression Methods
- Introduction
- Identifying and tackling multicollinearity
- Building Linear Regression model
- Using OLS to forecast how much electricity can be produced
- Estimating the output of an electric plant using CART
- Employing the kNN model in a regression problem
- Applying the Random Forest model to a regression analysis
- Gauging the amount of electricity a plant can produce using SVMs
- Training a Neural Network to predict the output of a power plant
- Chapter 7 : Time Series Techniques
- Introduction
- Handling date objects in Python
- Understanding time series data
- Smoothing and transforming the observations
- Filtering the time series data
- Removing trend and seasonality
- Forecasting the future with ARMA and ARIMA models
- Chapter 8 : Graphs
- Introduction
- Handling graph objects in Python with NetworkX
- Using Gephi to visualize graphs
- Identifying people whose credit card details were stolen
- Identifying those responsible for stealing the credit cards
- Chapter 9 : Natural Language Processing
- Introduction
- Reading raw text from the Web
- Tokenizing and normalizing text
- Identifying parts of speech, handling n-grams, and recognizing named entities
- Identifying the topic of an article
- Identifying the sentence structure
- Classifying movies based on their reviews
- Chapter 10 : Discrete Choice Models
- Introduction
- Preparing a dataset to estimate discrete choice models
- Estimating the well-known Multinomial Logit model
- Testing for violations of the Independence from Irrelevant Alternatives
- Handling IIA violations with the Nested Logit model
- Managing sophisticated substitution patterns with the Mixed Logit model
- Chapter 11 : Simulations
- Introduction
- Using SimPy to simulate the refueling process of a gas station
- Simulating out-of-energy occurrences for an electric car
- Determining if a population of sheep is in danger of extinction due to a wolf pack
- Index