development

Pandas DataFrame 열 헤더에서 목록 가져 오기

big-blog 2020. 9. 28. 09:30
반응형

Pandas DataFrame 열 헤더에서 목록 가져 오기


Pandas DataFrame에서 열 머리글 목록을 가져오고 싶습니다. DataFrame은 사용자 입력에서 나올 것이므로 얼마나 많은 열이 있는지 또는 무엇을 호출할지 알 수 없습니다.

예를 들어 다음과 같은 DataFrame이 제공되는 경우 :

>>> my_dataframe
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

다음과 같은 목록을 얻고 싶습니다.

>>> header_list
['y', 'gdp', 'cap']

다음을 수행하여 값을 목록으로 가져올 수 있습니다.

list(my_dataframe.columns.values)

또한 다음을 사용할 수 있습니다. ( Ed Chum의 답변에 표시된대로 ) :

list(my_dataframe)

가장 성능이 뛰어난 내장 방법이 있습니다.

my_dataframe.columns.values.tolist()

.columns인덱스를 .columns.values반환하고 배열 .tolist을 반환하며 목록을 반환하는 도우미 함수 가 있습니다.

성능이 중요하지 않은 경우 Index객체 .tolist()는 직접 호출 할 수 있는 메서드를 정의합니다 .

my_dataframe.columns.tolist()

성능의 차이는 분명합니다.

%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

입력을 미워하는 사람들을 위해, 당신은 단지 호출 할 수 있습니다 listdf이렇게 같이 :

list(df)

몇 가지 빠른 테스트를 수행했으며 당연히 사용하는 내장 버전 dataframe.columns.values.tolist()이 가장 빠릅니다.

In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop

In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop

In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop

In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop

(그래도 정말 마음에 들어요, list(dataframe)EdChum에게 감사드립니다!)


더 간단 해집니다 (pandas 0.16.0에 의해).

df.columns.tolist()

멋진 목록에 열 이름을 제공합니다.


>>> list(my_dataframe)
['y', 'gdp', 'cap']

디버거 모드에서 데이터 프레임의 열을 나열하려면 목록 이해를 사용하십시오.

>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']

그건 그렇고, 다음을 사용하여 정렬 된 목록을 얻을 수 있습니다 sorted.

>>> sorted(my_dataframe)
['cap', 'gdp', 'y']

그것은 my_dataframe.columns.


흥미롭지 만 그보다 df.columns.values.tolist()거의 3 배 빠르지 df.columns.tolist()만 똑같다고 생각했습니다.

In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop

In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop

놀랍게도 지금까지이 게시물을 보지 못했기 때문에 여기에 남겨 두겠습니다.

Extended Iterable Unpacking (python3.5 +) : [*df]및 친구

Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.

df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df

   A  B  C
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x 

If you want a list....

[*df]
# ['A', 'B', 'C']

Or, if you want a set,

{*df}
# {'A', 'B', 'C'}

Or, if you want a tuple,

*df,  # Please note the trailing comma
# ('A', 'B', 'C')

Or, if you want to store the result somewhere,

*cols, = df  # A wild comma appears, again
cols
# ['A', 'B', 'C']

... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)

P.S.: if performance is important, you will want to ditch the solutions above in favour of

df.columns.to_numpy().tolist()
# ['A', 'B', 'C']

This is similar to Ed Chum's answer, but updated for v0.24 where .to_numpy() is preferred to the use of .values. See this answer (by me) for more information.

Visual Check
Since I've seen this discussed in other answers, you can utilise iterable unpacking (no need for explicit loops).

print(*df)
A B C

print(*df, sep='\n')
A
B
C

Critique of Other Methods

Don't use an explicit for loop for an operation that can be done in a single line (List comprehensions are okay).

Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.

Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.

Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods.


A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.

my_dataframe.keys()

Create a list of keys/columns - object method to_list() and pythonic way

my_dataframe.keys().to_list()
list(my_dataframe.keys())

Basic iteration on a DataFrame returns column labels

[column for column in my_dataframe]

Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.

xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) #compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) #constant time operation - O(1)

In the Notebook

For data exploration in the IPython notebook, my preferred way is this:

sorted(df)

Which will produce an easy to read alphabetically ordered list.

In a code repository

In code I find it more explicit to do

df.columns

Because it tells others reading your code what you are doing.


%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

as answered by Simeon Visser...you could do

list(my_dataframe.columns.values) 

or

list(my_dataframe) # for less typing.

But I think most the sweet spot is:

list(my_dataframe.columns)

It is explicit, at the same time not unnecessarily long.


For a quick, neat, visual check, try this:

for col in df.columns:
    print col

This gives us the names of columns in a list:

list(my_dataframe.columns)

Another function called tolist() can be used too:

my_dataframe.columns.tolist()

I feel question deserves additional explanation.

As @fixxxer noted, the answer depends on the pandas version you are using in your project. Which you can get with pd.__version__ command.

If you are for some reason like me (on debian jessie I use 0.14.1) using older version of pandas than 0.16.0, then you need to use:

df.keys().tolist() because there is no df.columns method implemented yet.

The advantage of this keys method is, that it works even in newer version of pandas, so it's more universal.


n = []
for i in my_dataframe.columns:
    n.append(i)
print n

Even though the solution that was provided above is nice. I would also expect something like frame.column_names() to be a function in pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()

frame.columns.tolist() 

This solution lists all the columns of your object my_dataframe:

print(list(my_dataframe))

참고URL : https://stackoverflow.com/questions/19482970/get-list-from-pandas-dataframe-column-headers

반응형