1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import os
import re
import argparse
from collections import defaultdict

"""
This script counts the number of Markdown files in a specified directory that contain a specific date pattern
and calculates their total word count.

Usage:
python countNumber.py <year>

Arguments:
year (int): The year to search for in the date pattern (e.g., 2023).

Command Line Arguments:
year: The year to search for in the date pattern (e.g., 2023).

Example:
python countNumber.py 2023

Description:
- The script searches for Markdown files in the specified folder path.
- It looks for lines that match the pattern "date: yyyy-xx-xx" where yyyy is the specified year.
- It counts the number of files that match the pattern and calculates the total word count of these files.
- If the specified directory does not exist, it prints an error message.
- If there is a permission error or if the specified path is a directory, it prints an appropriate error message.

Variables:
folder_path (str): The path to the folder containing the Markdown files.
pattern (str): The regular expression pattern to match the date in the specified year.
count (int): The number of files that match the date pattern.
total_word_count (int): The total word count of the files that match the date pattern.

Exceptions:
- FileNotFoundError: If the specified directory does not exist.
"""

# 设置命令行参数解析
parser = argparse.ArgumentParser(description="Count Markdown files with a specific date pattern and their word count.")
parser.add_argument("year", type=int, help="The year to search for in the date pattern (e.g., 2023).")
args = parser.parse_args()

folder_path = "source/_posts" # 替换为你的文件夹路径
date_pattern = rf"date: {args.year}-\d{{2}}-\d{{2}}" # 匹配以 "date: yyyy-xx-xx" 格式开头的行
tag_pattern = re.compile(r'tags:\s*\n((?:\s*-\s[^\-].*\n)+)') # 改进后的标签模式,只匹配单个 -

count = 0
total_word_count = 0
tag_count = defaultdict(int)

try:
for entry in os.scandir(folder_path):
if entry.is_file() and entry.name.endswith(".md"):
try:
with open(entry.path, "r", encoding="utf-8") as file:
content = file.read()
if re.search(date_pattern, content):
count += 1
total_word_count += len(content)

# 统计标签数量
matches = tag_pattern.findall(content)
for match in matches:
tags = match.strip().split('\n')
for tag in tags:
tag_name = tag.strip('- ').strip()
tag_count[tag_name] += 1
except Exception as e:
print(f"Error reading file {entry.name}: {e}")
except FileNotFoundError:
print(f"Error: The directory '{folder_path}' does not exist.")
except PermissionError:
print(f"Error: You do not have permission to access '{folder_path}'.")
except IsADirectoryError:
print(f"Error: '{folder_path}' is a directory, not a file.")

print(f"在{args.year}年的文件数量为: {count}")
print(f"在{args.year}年的文件总字数为: {total_word_count}")
print("标签统计:")
for tag, count in tag_count.items():
print(f'{tag}: {count}')

说明:

这段代码使用 os.scandir() 函数来遍历文件夹中的所有项目,然后使用 entry.is_file() 方法判断每个项目是否为文件。如果是文件,则继续判断文件名是否以 .md 结尾。如果是以 .md 结尾的文件,则进行处理。对于文件夹,直接跳过处理。这样就可以忽略文件夹中的子文件夹了。

在处理每个 .md 文件时,代码会统计文件的总字数,并统计文件中每个标签的数量。具体步骤如下:

  1. 使用 len(content) 计算文件内容的总字数,并累加到 total_word_count 中。
  2. 使用正则表达式 tag_pattern.findall(content) 查找文件中的所有标签。
  3. 对于每个找到的标签,去掉多余的空格和特殊字符,然后统计每个标签的出现次数,并存储在 tag_count 字典中。

如果在读取文件时发生异常,代码会捕获并打印相应的错误信息,包括 FileNotFoundErrorPermissionErrorIsADirectoryError

最后,代码会打印指定年份的文件数量、文件总字数以及每个标签的统计结果。