Summary of Kaggle 'Pandas' Course (1) - Lessons 1–3
Summarizes practical ways to use the Pandas library for data cleaning and wrangling, based on Kaggle’s free 'Pandas' course with added notes. This post covers Lessons 1–3.
I summarize here what I studied through Kaggle’s Pandas course.
Since it’s fairly long, I split it into two parts.
- Part 1: Lessons 1–3 (this post)
- Part 2: Lessons 4–6
Lesson 1. Creating, Reading and Writing
Importing pandas
1
import pandas as pd
Pandas has two core objects: the DataFrame and the Series.
DataFrame
A DataFrame can be thought of as a table or a matrix. It consists of a matrix of independent entries, where each entry has a specific value and corresponds to a single row (or record) and a column.
1
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
| Yes | No | |
|---|---|---|
| 0 | 50 | 131 |
| 1 | 21 | 2 |
DataFrame entries don’t have to be numeric; the following is an example DataFrame with string values (user reviews).
1
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
| Bob | Sue | |
|---|---|---|
| 0 | I liked it. | Pretty good. |
| 1 | It was awful. | Bland. |
To create a DataFrame, use the pd.DataFrame() constructor and pass a Python dictionary. Put column names as keys and lists of values as dictionary values. This is the standard way to declare a new DataFrame.
When creating a DataFrame, you typically specify column labels (column names). If you don’t specify row labels, pandas assigns integers 0, 1, 2, … as row labels. If needed, you can set row labels manually. The list of row labels in a DataFrame is called the Index, and you can set it via the constructor’s index parameter.
1
2
3
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
| Bob | Sue | |
|---|---|---|
| Product A | I liked it. | Pretty good. |
| Product B | It was awful. | Bland. |
Series
A Series is a sequence of data values, i.e., a vector.
1
pd.Series([1, 2, 3, 4, 5])
A Series is essentially a single column of a DataFrame. As such, it can have an index, and instead of a “column name” it simply has a name.
1
pd.Series([30, 35, 40], index=['12015 Sales', '12016 Sales', '12017 Sales'], name='Product A')
1
2
3
4
12015 Sales 30
12016 Sales 35
12017 Sales 40
Name: Product A, dtype: int64
Series and DataFrame are closely related. You can think of a DataFrame as a collection of Series aligned by a shared index.
Reading data files
In many cases, you’ll import existing data rather than writing it from scratch. Data can be stored in various formats; the most basic is CSV. A CSV file typically looks like this:
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
CSV stands for “Comma-Separated Values,” i.e., a table where values are separated by commas.
To read CSV data into a DataFrame, use pd.read_csv().
1
product_reviews = pd.read_csv("../input/product-reviews/example-data.csv")
You can check a DataFrame’s shape with the shape attribute.
1
product_reviews.shape
1
(129971, 14)
This output means the DataFrame has 129,971 records (rows) and 14 columns.
Use the head() method to preview the first five rows.
1
product_reviews.head()
pd.read_csv() has over 30 parameters. For example, if the CSV already contains an index column, you can use index_col to tell pandas to use that column instead of creating a new integer index.
1
product_reviews = pd.read_csv("../input/product-reviews/example-data.csv", index_col=0)
Writing data files
You can export a DataFrame to CSV using the to_csv() method:
1
product_reviews.to_csv("../output/product-reviews/example-data.csv")
Lesson 2. Indexing, Selecting & Assigning
Selecting specific values from a pandas DataFrame or Series is a step you’ll perform in almost every data-processing task, so it’s essential to learn how to pick out the data points you need quickly and efficiently.
Native Python accessors
Native Python objects provide excellent indexing methods, and pandas adopts those same mechanisms.
Object attributes
In Python, you access an object’s property via its attribute name. For example, if example_obj has an attribute title, you can call example_obj.title. The same works for DataFrame columns.
1
reviews.country
1
2
3
4
5
6
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
Dictionary indexing
Python dictionaries use the indexing operator ([]) to access values. DataFrame columns can be accessed the same way.
1
reviews['country']
1
2
3
4
5
6
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
Both attribute access and dictionary-style access are valid; however, the dictionary style can handle column names containing reserved characters like spaces (e.g., reviews['country providence'] works, whereas reviews.country providence does not).
You can also index into the resulting pandas Series to retrieve an individual value:
1
reviews['country'][0]
1
'Italy'
Pandas-specific accessors
While attribute and [] accessors integrate naturally with the wider Python ecosystem, pandas also provides its own dedicated accessors: loc and iloc.
Index-based selection
Use iloc for index-based selection—that is, selecting by integer position.
For example, select the first row of the DataFrame:
1
reviews.iloc[0]
1
2
3
4
5
6
country Italy
description Aromas include tropical fruit, broom, brimston...
...
variety White Blend
winery Nicosia
Name: 0, Length: 13, dtype: object
Unlike native Python, where you select a column first and then rows, iloc selects rows first, then columns. Select the first column like this:
1
reviews.iloc[:, 0]
1
2
3
4
5
6
0 Italy
1 Portugal
...
129969 France
129970 France
Name: country, Length: 129971, dtype: object
Here : selects all rows, and 0 picks the first column. To select the second (1) and third (2) rows of the first column:
1
reviews.iloc[1:3, 0]
1
2
3
1 Portugal
2 US
Name: country, dtype: object
Or pass a list:
1
reviews.iloc[[1, 2], 0]
1
2
3
1 Portugal
2 US
Name: country, dtype: object
You can also use negative indices to select from the end. For example, the last five rows:
1
reviews.iloc[-5:]
Label-based selection
Alternatively, use loc for label-based selection—that is, selecting by index labels.
For example, to get the entry at row label 0 and column ‘country’:
1
reviews.loc[0, 'country']
1
'Italy'
iloc ignores the dataset’s index labels and treats the data as one big array, accessing by position. In contrast, loc uses the index information, which is often more intuitive because indexes usually carry meaning.
Range semantics: iloc vs loc
iloc follows Python’s standard half-open slicing, so 0:10 means 0 up to but not including 10 (i.e., 0,...,9).
loc treats slices as closed intervals, so 0:10 means 0 through 10 inclusive (i.e., 0,...,10).
The reason is that loc supports not just integers but any sortable label type. For example, suppose a DataFrame has labels like Apples, ..., Potatoes, .... Selecting alphabetically from ‘Apples’ to ‘Potatoes’ is more intuitive as df.loc['Apples':'Potatoes'] than something like “from ‘Apples’ up to (but not including) ‘Potatoet’.” For non-integer indices, closed intervals are typically more natural, hence loc uses them.
Otherwise, their behavior is broadly similar.
Personally, when working with ascending integer indices and slicing with
:, I preferilocto avoid confusion over slice semantics; in other cases, I findlocmore intuitive.
Manipulating the index
You can adjust the index as needed. For example, use set_index() to make a specific column the new index:
1
reviews.set_index("title")
Conditional selection
So far we’ve selected data using structural attributes of the DataFrame. You can also select rows that meet more complex logical conditions.
For example, suppose you have a DataFrame of wine reviews and you need wines from Italy with a score of at least 90.
1
reviews.country == 'Italy'
This condition returns a Series of boolean values:
1
2
3
4
5
6
0 True
1 False
...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
loc is label-based, but it also accepts a boolean array or an alignable boolean Series. Thus you can select only Italian wines like this:
1
reviews.loc[reviews.country == 'Italy']
Combine multiple conditions with & and |. Italian wines with scores ≥ 90:
1
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]
Italian wines or wines with scores ≥ 90:
1
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
Pandas also provides helpful built-in selectors, notably isin and isna/notna.
isin returns a boolean mask indicating whether each value is in a given list. For example, select wines from Italy or France:
1
reviews.loc[reviews.country.isin(['Italy', 'France'])]
Use isna/notna to filter missing values (NaN). For example, select rows with non-missing prices:
1
reviews.loc[reviews.price.notna()]
Note: Although not mentioned in the original Kaggle course,
iloccan also take a boolean array. Unlikeloc, it supports arrays but not Series, so the kinds of alignable boolean masking shown above aren’t directly applicable.
Assigning data
You can add new data to a DataFrame or overwrite existing columns.
1
2
reviews['critic'] = 'everyone'
reviews['critic']
1
2
3
4
5
6
0 everyone
1 everyone
...
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
1
2
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']
1
2
3
4
5
6
0 129971
1 129970
...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64
Lesson 3. Summary Functions and Maps
Quick summaries
The describe() method provides a high-level summary of a given column.
1
reviews.points.describe()
1
2
3
4
5
6
count 129971.000000
mean 88.447138
...
75% 91.000000
max 100.000000
Name: points, Length: 8, dtype: float64
The output of describe() depends on the data type. For non-numeric (string) data:
1
reviews.taster_name.describe()
1
2
3
4
5
count 103727
unique 19
top Roger Voss
freq 25514
Name: taster_name, dtype: object
You can also compute specific statistics directly:
1
reviews.points.mean()
1
88.44713820775404
1
reviews.taster_name.unique()
1
2
3
4
5
6
7
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
'Christina Pickard'], dtype=object)
To count occurrences of unique values, use value_counts():
1
reviews.taster_name.value_counts()
1
2
3
4
5
6
Roger Voss 25514
Michael Schachner 15134
...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64
Maps
A map is a function that transforms elements from one set to another. In data science, we often need to transform data into other representations; maps are essential for such tasks.
Two methods are used frequently.
Series.map() takes a function that converts a single value to another single value, applies it to every value in the Series, and returns a new Series. For example, to subtract the mean from each wine score:
1
2
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
1
2
3
4
5
6
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
DataFrame.apply() applies a custom function to each row (or column) of a DataFrame.
1
2
3
4
5
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
Call apply() with axis='index' to apply a function column-wise instead of row-wise.
Both Series.map() and DataFrame.apply() return new transformed objects and do not modify the original data.
| Method | Series.map() | DataFrame.apply() |
|---|---|---|
| Target | Series | DataFrame |
| Granularity | Apply per value (if you regard a Series as a column vector, this is row-wise) | Row-wise by default Can be column-wise with an option |
Note that
Series.apply()andDataFrame.map()also exist.
Series.apply():
by_row='compat'(default): behaves likeSeries.map()by_row=False: passes the entire Series to the function at once (similar toDataFrame.apply()withaxis='index')DataFrame.map(): applies a function to each individual value in the DataFrame (analogous toSeries.map()but for DataFrames)
Pandas also provides many common vectorized transformations natively. The example above can be written much more simply, and pandas will still infer and perform the intended operation:
1
2
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean
1
2
3
4
5
6
0 -1.447138
1 -1.447138
...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
Pandas supports operations between Series of the same length, too. In the wine example, you can concatenate strings across two columns:
1
reviews.country + " - " + reviews.region_1
1
2
3
4
5
6
0 Italy - Etna
1 NaN
...
129969 France - Alsace
129970 France - Alsace
Length: 129971, dtype: object
These vectorized operations use pandas’ internal acceleration and are faster than map() or apply(). Still, map() and apply() are more flexible and can handle more complex transformations, so they’re good to know.

