Quick answer
Python read bytes as UTF-8, but they are not valid UTF-8. The file is in a different encoding (often Windows-1252/Latin-1 from Excel), has a UTF-16/UTF-8 BOM, or is binary. Open it with the correct encoding= — cp1252, utf-16, or utf-8-sig — or in binary mode ('rb') if it isn't text.
The exact error string
with open('export.csv') as f: # defaults to encoding='utf-8'
data = f.read()
Traceback (most recent call last):
File "app.py", line 2, in <module>
data = f.read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
The same error appears with other bytes — 0x80, 0x92 (a Windows "smart quote"), 0xe9 (é in Latin-1) — and a different position. The byte and position are the clue to which cause you have. Note the headline case specifically — 0xff at position 0 — is almost always a UTF-16 byte-order mark, so jump to Cause 2 (BOM); a bad byte mid-file points instead to a wrong encoding like Windows-1252 (Cause 1).
Why UTF-8 decoding fails
Python 3 opens files in text mode as UTF-8 by default. UTF-8 has strict rules about which byte values can start a character, so any file that is not UTF-8 will eventually hit a byte that is illegal as a "start byte" and raise UnicodeDecodeError. The decoder is correct — the assumption that the file is UTF-8 is wrong.
Cause 1: the file is Windows-1252 / Latin-1
Spreadsheets exported from Excel, and a lot of legacy data, are Windows-1252 (a.k.a. cp1252) or ISO-8859-1 (Latin-1). Tell Python the real encoding:
# Western European text from Excel:
with open('export.csv', encoding='cp1252') as f:
data = f.read()
# Latin-1 never raises (it maps all 256 byte values) — handy, but it will
# produce WRONG characters if the file is actually some other encoding:
with open('export.csv', encoding='latin-1') as f:
data = f.read()
Cause 2: a byte-order mark (BOM)
If the failing byte is at position 0, suspect a BOM. 0xff 0xfe means the file is UTF-16; 0xef 0xbb 0xbf is a UTF-8 BOM that the plain UTF-8 codec leaves as a stray character:
# 0xff 0xfe at position 0 -> the file is UTF-16
with open('data.txt', encoding='utf-16') as f:
data = f.read()
# 0xef 0xbb 0xbf -> UTF-8 with BOM; utf-8-sig strips it
with open('data.csv', encoding='utf-8-sig') as f:
data = f.read()
Cause 3: reading a CSV with pandas
The same fix applies to pandas.read_csv, which also defaults to UTF-8 — pass the encoding the file actually uses (this is by far the most common place the error shows up, since CSV files from Excel are rarely UTF-8):
import pandas as pd
df = pd.read_csv('export.csv', encoding='cp1252') # or 'latin-1', 'utf-16'
Need the CSV as JSON for an API or config? Once it loads cleanly, convert it in your browser with CSV to JSON — no upload, no server.
Cause 4: the file is binary, not text
If the data is an image, a zip, a Parquet file, or a pickle, it should never be decoded as text. Open it in binary mode and work with bytes:
with open('photo.png', 'rb') as f: # 'rb' = read binary, no decoding
raw = f.read() # raw is bytes, not str
# To move binary safely through JSON/text, base64-encode it:
import base64
b64 = base64.b64encode(raw).decode('ascii')
To eyeball or sanity-check a base64 value by hand, paste it into the Base64 encoder / decoder. Storing bytes in JSON this way is also the standard answer to the related Python error Object of type bytes is not JSON serializable.
Detecting an unknown encoding
When you genuinely don't know the encoding, detect it — then verify the result, because detection is a heuristic:
# pip install charset-normalizer
from charset_normalizer import from_path
best = from_path('mystery.csv').best()
print(best.encoding) # e.g. 'cp1252'
text = str(best) # decoded with the detected encoding
A note on errors='ignore' / 'replace'
You will see advice to add errors='replace' or errors='ignore'. These make the exception go away by substituting or dropping the bad bytes — which silently corrupts or loses data. Use them only for best-effort display of already-broken input; for data you care about, find the correct encoding instead.
Debugging checklist
- ✓ Read the failing byte and position — position 0 usually means a BOM
- ✓
0xff 0xfeat start →encoding='utf-16';0xef 0xbb 0xbf→'utf-8-sig' - ✓ Excel/legacy text → try
encoding='cp1252'(or'latin-1'as a never-fail fallback) - ✓ Using pandas? Pass
encoding=toread_csv - ✓ Binary file? Open with
'rb'and don't decode; base64 it to put in JSON - ✓ Unknown? Detect with
charset-normalizer, then verify the text - ✓ Avoid
errors='ignore'/'replace'for data you need intact
Frequently Asked Questions
What does 'utf-8 codec can't decode byte 0xff' mean?
It means Python tried to read bytes as UTF-8 text, but those bytes are not valid UTF-8. The byte 0xff (or 0x80, 0x92, etc.) does not start a legal UTF-8 sequence, so decoding stops. The file is almost certainly in a different encoding, or it is binary data, not text.
How do I fix it when reading a file?
Open the file with the encoding it actually uses, not the default UTF-8. For Western European text from Excel that is usually open(path, encoding='cp1252') or encoding='latin-1'. Latin-1 never fails because it maps all 256 byte values, but it can produce wrong characters if the file is really another encoding.
What is 0xff or 0xfe at the start of the file?
Bytes 0xff 0xfe (or 0xfe 0xff) at position 0 are a UTF-16 byte-order mark. The file is UTF-16, not UTF-8. Read it with encoding='utf-16'. If you see 0xef 0xbb 0xbf, that is a UTF-8 BOM — use encoding='utf-8-sig' to strip it.
How do I read a CSV that throws UnicodeDecodeError?
CSV exported from Excel is often Windows-1252, not UTF-8. With pandas pass the encoding: pd.read_csv(path, encoding='cp1252') or encoding='latin-1'. If unsure, detect it first with charset-normalizer, or re-export the file as UTF-8 from the source application.
Should I use errors='ignore' or errors='replace'?
Only as a last resort. open(path, encoding='utf-8', errors='replace') replaces undecodable bytes with a placeholder and errors='ignore' drops them — both silently corrupt or lose data. Prefer finding the correct encoding; use these flags only for best-effort display of partly broken input.
How do I detect the encoding automatically?
Use the charset-normalizer or chardet library: from charset_normalizer import from_path; best = from_path('file.csv').best(); print(best.encoding). Detection is a heuristic, not a guarantee, so treat the result as a strong hint and verify the decoded text looks right.
Got the CSV loading? Turn it into JSON
Once the file decodes cleanly, convert CSV to JSON in your browser — nothing is uploaded to a server.